Next-generation sequencing libraries

ABSTRACT

Provided herein is technology relating to next-generation sequencing and particularly, but not exclusively, to methods and compositions for preparing a next-generation sequencing library comprising short overlapping DNA fragments and using the library to sequence one or more target nucleic acids.

This application is a divisional application of U.S. patent applicationSer. No. 16/023,574, filed Jun. 29, 2018, which is a divisionalapplication of U.S. patent application Ser. No. 14/463,498, filed Aug.19, 2014, now U.S. Pat. No. 10,036,013, issued Jul. 31, 2018, whichclaims priority to U.S. provisional patent application Ser. No.61/867,224, filed Aug. 19, 2013, each of which is incorporated herein byreference in its entirety.

FIELD OF INVENTION

Provided herein is technology relating to next-generation sequencing andparticularly, but not exclusively, to methods, compositions, kits, andsystems for preparing a next-generation sequencing library comprisingoverlapping DNA fragments and using the library to sequence one or moretarget nucleic acids.

BACKGROUND

Nucleic acid sequences encode the necessary information for livingthings to function and reproduce. Determining such sequences istherefore a tool useful in pure research into how and where organismslive, as well as in applied sciences such as drug development. Inmedicine, sequencing tools are used for diagnosis and to developtreatments for a variety of pathologies, including cancer, infectiousdisease, heart disease, autoimmune disorders, multiple sclerosis, andobesity. In industry, sequencing is used to design improved enzymaticprocesses and synthetic organisms. In biology, such tools are used tostudy the health of ecosystems, for example, and thus have a broad rangeof utility.

One focus of the sequencing industry has shifted to finding higherthroughput and/or lower cost nucleic acid sequencing technologies,sometimes referred to as “next generation” sequencing (NGS)technologies. In making sequencing higher throughput and/or lessexpensive, the goal is to make the technology more accessible forsequencing. These goals can be reached through using sequencingplatforms and methods that provide sample preparation for largerquantities of samples of significant complexity, sequencing largernumbers of complex samples, and/or providing a high volume ofinformation generation and analysis in a short period of time. Variousmethods, such as, for example, sequencing by synthesis, sequencing byhybridization, and sequencing by ligation are evolving to meet thesechallenges.

Many next-generation sequencing (NGS) platforms are available for thehigh-throughput, massively parallel sequencing of nucleic acids. Many ofthese systems, such as the HiSeq and MiSeq systems produced by Illumina,use a sequencing-by-synthesis (SBS) approach, wherein a nucleotidesequence is determined using base-by-base detection and identification.Using this particular approach, identifying 1 base requires 1 cycle ofthe SBS chemistry process (which may involve four separate reactionsseparated by washes).

Currently, these technologies provide a maximum achievable read lengthof ˜250 bases, which can be extended to ˜400 (2×250 bases withsufficient overlap for assembly) if two high-quality paired-end readsare acquired from the same template and assembled. Each SBS cycle takesapproximately 4 minutes to complete; thus, in a paired-end approach toacquire ˜400 bases of sequence information, the 500 cycles of SBSrequired to produce the two reads of ˜250 bases takes approximately 37hours to complete. In addition, most of the cyclic sequencingtechnologies' performance and quality substantially decrease afterdetermining ˜100 bases, introducing a degree of uncertainty associatedwith individual sequence reads longer than ˜100 bases and the longersequence assemblies in which they are used. Due to these quality andtime limitations of current NGS platforms, the ever-increasing demandsfor long, high-quality nucleotide sequences are saturating the outputcapabilities of the installed base of sequencing apparatuses.Consequently, technologies are needed that provide high-qualitysequences of ˜500 bases or more from a much shorter sequencing run-timeof several hours rather than several days.

SUMMARY

Some attempts to acquire longer sequences by NGS technology have appliedthe approach of assembling multiple short reads to produce a longersequence. For example, the Moleculo technology provided by Illuminainitially isolates a single copy of a long (˜10 Kbp) DNA fragment. Thislong DNA fragment is clonally amplified and subsequently fragmented intosmaller pieces of approximately 300-800 bases. Finally, adaptors withbarcodes are appended to the smaller pieces using a transposase togenerate the sequencing library. A standard SBS protocol is used toacquire ˜300-500 bases of sequence from the target template (2×150 basesor 2×250 bases) and, once the sequences are generated, the barcodes areused to parse and assemble the reads to provide the sequence of theoriginal ˜10 Kbp DNA. Another method involves creation of an overlappingfragment library suitable for an Illumina sequencer, which producesreads ranging from ˜400-460 bases by assembling two ˜250-base reads thatoverlap by ˜20-50 bases (see, e.g., Lundin, et al. (2012) ScientificReports 3: 1186). This overlapping library is constructed mainly bytagging fragments with specific adaptor sequences, followed by adigestion step and a precise size selection process.

Accordingly, provided herein is a technology for sequencing thatutilizes a relatively short read length (e.g., less than 300 or lessthan 200 bases, e.g., ˜30-50 bases) to achieve a high-quality, longcontiguous sequence comparable or superior to conventional technologies.In contrast to conventional technologies, the technology providedrequires only a short period of run-time (e.g., ˜3-4 hours) on asequencer (e.g., Illumina MiSeq platform), thus dramatically decreasingthe time dedicated to use of the sequencing apparatus required tocomplete a sequencing run. Moreover, the technology results in longersequences (e.g., ˜500 bp to 1000 bp or more of high quality sequence)than conventional technology. Also, run-time does not increase as afunction of the size of the nucleic acid to be sequenced because theshort read size (e.g., ˜30-50) remains the same regardless of the sizeof the nucleic acid to be sequenced.

The technology is not limited to any particular sequencing platform, butis generally applicable and platform independent. For example, inaddition to decreases in run-time on Illumina systems, similar timereductions are achieved for sequences acquired using, e.g., LifeTechnologies Ion Torrent and Qiagen GeneReader systems. In particular,while acquiring a ˜400 base sequence using conventional Ion Torrentsample preparation and sequencing technology requires approximately 4hours, the technology provided herein reduces that time to approximately20 to 30 minutes. In some embodiments, the technology is applicable toemulsion PCR-based methods, bead-based, and non-based methods, and thusfinds use in the Life Technologies SOLiD systems and the Qiagen NGSsequencing platforms.

This technology provides high quality sequence in a decreased sequencingtime relative to conventional technologies. The technology is platformagnostic and thus is compatible with extant sequencing apparatuses. Thetechnology, in some embodiments, enhances existing NGS platforms by,e.g., increasing the read length of extant platforms and shortening thetime to sequence acquisition. Furthermore, an added advantage of thepresent technology is that it reduces consumption of expensivesequencing reagents and thus can decrease the overall per-base cost ofsequencing.

In short, the technology involves producing a set of defined overlappingshort sequence library inserts (e.g., less than 300 or less than 200bases, e.g., ˜30-50 bases) tiled over a region of a nucleic acid to besequenced and offset from one another by, e.g., 1-20, 1-10, or 1-5 bases(e.g., in some embodiments, by 1 base). After producing the set ofsequences using the overlapping libraries, bioinformatic assemblyalgorithms are used to “stitch” the tiled set of short overlappingsequences together to produce the sequence of the nucleic acid.

First, sequence quality is high because each base in the nucleic acid tobe sequenced is sequenced with high coverage (e.g., 10-fold to 1000-foldcoverage, e.g., 50-fold to 500-fold coverage) depending on the length ofthe short sequences acquired and the offset between adjacent tiledsequences. The high sampling rate at each base minimizes or eliminatessequencing errors by providing increased information to the assemblyprocess that determines the consensus identity of each base. Inaddition, the first bases (e.g., the first ˜20-100 bases) determined ina sequencing run generally have the best quality. Thus, by using theseinitial bases determined during the first part of each sequencing run(e.g., the first ˜30-50 bases), high quality sequence information isused in the assembly. The technology thus minimizes sequencing errors,especially in applications where long sequence reads are desired thatretain phasing and linkage information associated with the reads andassemblies.

Second, sequencer time is reduced because determining each shortsequence (e.g., ˜30-50 bases) requires only a small number of sequencingcycles (e.g., 1 cycle per base, e.g., ˜30-50 cycles) on the sequencingapparatus. By determining all the short sequences in the set of shortsequences in parallel, the sequencing time needed to provide thesequence of the nucleic acid to be sequenced is greatly reduced, e.g.,to one-eighth to one-tenth of the time needed by conventionaltechnologies to sequence the same nucleic acid to be sequenced.

This technology for NGS library preparation and sequencing and thesubsequent short-read parsing and assembly provides acquisition of morethan ˜500 bp (e.g., 600, 700, 800 bp or more) of high-quality contiguoussequence with phase information. The technology finds use, e.g., insequencing unknown regions starting from a known region, for example, tointerrogate structural variants such as gene translocations, e.g., thedetection and identification of unknown gene fusion partners. Moreover,the technology enhances existing NGS platforms' sequencing capabilitiesrelative to read length, run time, and cost without any upgrades and/orchanges to existing installed hardware and extant sequencingchemistries.

In some embodiments, the technology is related to a method fordetermining a target nucleotide sequence, the method comprisingdetermining a first nucleotide subsequence of the target nucleotidesequence, said first nucleotide subsequence having a 5′ end atnucleotide x1 of the target nucleotide sequence and having a 3′ end atnucleotide y1 of the target nucleotide sequence; determining a secondnucleotide subsequence of the target nucleotide sequence, said secondnucleotide subsequence having a 5′ end at nucleotide x2 of the targetnucleotide sequence and having a 3′ end at nucleotide y2 of the targetnucleotide sequence; assembling the first nucleotide subsequence and thesecond nucleotide subsequence to provide a consensus sequence for thetarget nucleotide sequence, wherein x2<y1; and (y1−x1)<100, (y2−x2)<100,and (y2−y1)<5. In some embodiments, the fragments are less than 100 bp,less than 90 bp, less than 80 bp, less than 70 bp, less than 60 bp, lessthan 55 bp, less than 50 bp, less than 45 bp, less than 40 bp, or lessthan 35 bp. Accordingly, in some embodiments, (y1−x1)<100, 90, 80, 70,60, 55, 50, 45, 40, or 35 and (y2−x2)<100, 90, 80, 70, 60, 55, 50, 45,40, or 35. In some embodiments, the fragments are less than 50 bp;accordingly, in some embodiments, (y1−x1)<50 and (y2−x2)<50.

In some embodiments, the 3′ ends of the fragments differ with respect tothe target sequence by less than 4 or less than 3 bases; accordingly, insome embodiments, (y2−y1)<4 or (y2−y1)<3. In some embodiments, the 3′ends of the fragments differ with respect to the target sequence by 1base; accordingly, in some embodiments (y2−y1)=1.

In some embodiments, a unique index (a “marker” in some embodiments) isused to associate a fragment with the template nucleic acid from whichit was produced. In some embodiments, a unique index is a uniquesequence of synthetic nucleotides or a unique sequence of naturalnucleotides that allows for easy identification of the target nucleicacid within a complicated collection of oligonucleotides (e.g.,fragments) containing various sequences. In certain embodiments, uniqueindex identifiers are attached to nucleic acid fragments prior toattaching adaptor sequences. In some embodiments, unique indexidentifiers are contained within adaptor sequences such that the uniquesequence is contained in the sequencing reads. This ensures thathomologous fragments can be detected based upon the unique indices thatare attached to each fragment, thus further providing for unambiguousreconstruction of a consensus sequence. Homologous fragments may occurfor example by chance due to genomic repeats, two fragments originatingfrom homologous chromosomes, or fragments originating from overlappinglocations on the same chromosome. Homologous fragments may also arisefrom closely related sequences (e.g., closely related gene familymembers, paralogs, orthologs, ohnologs, xenologs, and/or pseudogenes).Such fragments may be discarded to ensure that long fragment assemblycan be computed unambiguously. The markers may be attached as describedabove for the adaptor sequences. The indices (e.g., markers) may beincluded in the adaptor sequences.

In some embodiments, the unique index (e.g., index identifier, tag,marker, etc.) is a “barcode”. As used herein, the term “barcode” refersto a known nucleic acid sequence that allows some feature of a nucleicacid with which the barcode is associated to be identified. In someembodiments, the feature of the nucleic acid to be identified is thesample or source from which the nucleic acid is derived. In someembodiments, barcodes are at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, or more nucleotides in length. In some embodiments, barcodes areshorter than 10, 9, 8, 7, 6, 5, or 4 nucleotides in length. In someembodiments, barcodes associated with some nucleic acids are of adifferent length than barcodes associated with other nucleic acids. Ingeneral, barcodes are of sufficient length and comprise sequences thatare sufficiently different to allow the identification of samples basedon barcodes with which they are associated. In some embodiments, abarcode and the sample source with which it is associated can beidentified accurately after the mutation, insertion, or deletion of oneor more nucleotides in the barcode sequence, such as the mutation,insertion, or deletion of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or morenucleotides. In some embodiments, each barcode in a plurality ofbarcodes differs from every other barcode in the plurality at two ormore nucleotide positions, such as at 2, 3, 4, 5, 6, 7, 8, 9, 10, ormore positions. In some embodiments, one or more adaptors comprise(s) atleast one of a plurality of barcode sequences. In some embodiments,methods of the technology further comprise identifying the sample orsource from which a target nucleic acid is derived based on a barcodesequence to which the target nucleic acid is joined. In someembodiments, methods of the technology further comprise identifying thetarget nucleic acid based on a barcode sequence to which the targetnucleic acid is joined. Some embodiments of the method further compriseidentifying a source or sample of the target nucleotide sequence bydetermining a barcode nucleotide sequence. Some embodiments of themethod further comprise molecular counting applications (e.g., digitalbarcode enumeration and/or binning) to determine expression levels orcopy number status of desired targets. In general, a barcode maycomprise a nucleic acid sequence that when joined to a target nucleicacid serves as an identifier of the sample from which the targetpolynucleotide was derived.

In some embodiments, the methods provide a sequence of up to 100 basesor, in some embodiments, a sequence of more than 100, 200, 300, 400,500, 600, 700, 800, 900, 1000, or more bases. In some embodiments, thetechnology provides a sequence of more than 1000 bases, e.g., more than2000, 2500, 3000, 3500, 4000, 4500, or 5000 or more bases. In someembodiments the consensus sequence comprises up to 100 bases or more,e.g., 200, 300, 400, 500, 600, 700, 800, 900, 1000, or more bases; insome embodiments the consensus sequence comprises more than 1000 bases,e.g., more than 2000, 2500, 3000, 3500, 4000, 4500, or 5000 or morebases.

In some embodiments, an oligonucleotide such as a primer, adaptor, etc.comprises a “universal” sequence. A universal sequence is a knownsequence, e.g., for use as a primer or probe binding site using a primeror probe of a known sequence (e.g., complementary to the universalsequence). While a template-specific sequence of a primer, a barcodesequence of a primer, and/or a barcode sequence of an adaptor mightdiffer in embodiments of the technology, e.g., from fragment tofragment, from sample to sample, from source to source, or from regionof interest to region of interest, embodiments of the technology providethat a universal sequence is the same from fragment to fragment, fromsample to sample, from source to source, or from region of interest toregion of interest so that all fragments comprising the universalsequence can be handled and/or treated in a same or similar manner,e.g., amplified, identified, sequenced, isolated, etc., using similarmethods or techniques (e.g., using the same primer or probe).

In particular embodiments, a primer is used comprising a universalsequence (e.g., universal sequence A), a barcode sequence, and atemplate-specific sequence. In particular embodiments, a first adaptorcomprising a universal sequence (e.g., universal sequence B) is used andin particular embodiments, a second adaptor comprising a universalsequence (e.g., universal sequence C) is used. Universal sequence A,universal sequence B, and universal sequence C can be any sequence. Thisnomenclature is used to note that the universal sequence A of a firstnucleic acid (e.g., a fragment) comprising universal sequence A is thesame as the universal sequence A of a second nucleic acid (e.g., afragment) comprising universal sequence A, the universal sequence B of afirst nucleic acid (e.g., a fragment) comprising universal sequence B isthe same as the universal sequence B of a second nucleic acid (e.g., afragment) comprising universal sequence B, and the universal sequence Cof a first nucleic acid (e.g., a fragment) comprising universal sequenceC is the same as the universal sequence C of a second nucleic acid(e.g., a fragment) comprising universal sequence C. While universalsequences A, B, and C are generally different in embodiments of thetechnology, they need not be. Thus, in some embodiments, universalsequences A and B are the same; in some embodiments, universal sequencesB and C are the same; in some embodiments, universal sequences A and Care the same; and in some embodiments, universal sequences A, B, and Care the same. In some embodiments, universal sequences A, B, and C aredifferent.

For example, if two regions of interest are to be sequenced (e.g., fromthe same or different sources or, e.g., from two different regions ofthe same nucleic acid, chromosome, gene, etc.), two primers may be used,one primer comprising a first template-specific sequence for primingfrom the first region of interest and a first barcode to associate thefirst amplified product with the first region of interest and a secondprimer comprising a second template-specific sequence for priming fromthe second region of interest and a second barcode to associate thesecond amplified product with the second region of interest. These twoprimers, however, in some embodiments, will comprise the same universalsequence (e.g., universal sequence A) for pooling and downstreamprocessing together. Two or more universal sequences may be used and, ingeneral, the number of universal sequences will be less than the numberof target-specific sequences and/or barcode sequences for pooling ofsamples and treatment of pools as a single sample (batch).

Accordingly, in some embodiments, determining the first nucleotidesubsequence and the second nucleotide subsequence comprises priming froma universal sequence. In some embodiments determining the firstnucleotide subsequence and the second nucleotide subsequence comprisesterminating polymerization with a 3′-O-blocked nucleotide analog. Forexample, in some embodiments determining the first nucleotidesubsequence and the second nucleotide subsequence comprises terminatingpolymerization with a 3′-O-alkynyl nucleotide analog, e.g., in someembodiments determining the first nucleotide subsequence and the secondnucleotide subsequence comprises terminating polymerization with a3′-O-propargyl nucleotide analog. In some embodiments determining thefirst nucleotide subsequence and the second nucleotide subsequencecomprises terminating polymerization with a nucleotide analog comprisinga reversible terminator.

The obtained short sequence reads are partitioned according to theirbarcode (e.g., de-multiplexed) and reads originating from the samesamples, sources, regions of interest, etc. are binned together, e.g.,saved to separate files or held in an organized data structure thatallows binned reads to be identified as such. Then the binned shortsequences are assembled into a consensus sequence. Sequence assembly cangenerally be divided into two broad categories: de novo assembly andreference genome mapping assembly. In de novo assembly, sequence readsare assembled together so that they form a new and previously unknownsequence. In reference genome mapping, sequence reads are assembledagainst an existing backbone sequence (e.g., a reference sequence, etc.)to build a sequence that is similar but not necessarily identical to thebackbone sequence.

Thus, in some embodiments, target nucleic acids corresponding to eachregion of interest are reconstructed using a de-novo assembly. To beginthe reconstruction process, short reads are stitched togetherbioinformatically by finding overlaps and extending them to produce aconsensus sequence. In some embodiments the method further comprisesmapping the consensus sequence to a reference sequence. Methods of thetechnology take advantage of sequencing quality scores that representbase calling confidence to reconstruct full length fragments. Inaddition to de-novo assembly, fragments can be used to obtain phasing(assignment to homologous copies of chromosomes) of genomic variants byobserving that consensus sequences originate from either one of thechromosomes.

In some embodiments, a computer system is implemented for assembly andbioinformatic treatment of sequence information (e.g., identifyingbarcodes, partitioning, binning, making base calls, determining aconsensus identity of each base, stitching reads, assessing qualityscores, aligning reads and/or consensus sequences to a referencesequence, etc.). In various embodiments, a computer system includes abus or other communication mechanism for communicating information and aprocessor coupled with the bus for processing information. In variousembodiments, the computer system includes a memory, which can be arandom access memory (RAM) or other dynamic storage device, coupled tothe bus, and instructions to be executed by the processor. The memoryalso can be used for storing temporary variables or other intermediateinformation during execution of instructions to be executed by theprocessor. In various embodiments, the computer system further includesa read only memory (ROM) or other static storage device coupled to thebus for storing static information and instructions for the processor.In some embodiments, a storage device, such as a solid state drive(e.g., “flash” memory), a magnetic disk, or an optical disk, is providedand coupled to the bus for storing information and instructions.

In various embodiments, the computer system is coupled via the bus to adisplay, such as a cathode ray tube (CRT) or liquid crystal display(LCD), for displaying information to a computer user. In someembodiments, an input device, including alphanumeric and other keys, iscoupled to the bus for communicating information and command selectionsto the processor. Another type of user input device is a cursor control,such as a mouse, a trackball, or cursor direction keys for communicatingdirection information and command selections to the processor and forcontrolling cursor movement on the display.

In some embodiments, a computer system performs aspects of the presenttechnology. Consistent with certain embodiments of the technology,results are provided by the computer system in response to the processorexecuting one or more sequences of one or more instructions contained inmemory. Such instructions can be read into memory from anothercomputer-readable medium, such as the storage device. Alternatively,hard-wired circuitry can be used in place of or in combination withsoftware instructions to implement the present technology. Thusimplementations of the present teachings are not limited to any specificcombination of hardware circuitry and software. For example, asdescribed herein, embodiments of the technology comprise the use ofstorage and transfer of data using “cloud” computing technology, wired(e.g., fiber optic, cable, copper, ADSL, Ethernet, and the like), and/orwireless technology (e.g., IEEE 802.11 and the like). As describedherein, in some embodiments, components of the technology are connectedvia a local area network (LAN), wireless local area network (WLAN), widearea network (WAN) such as the internet, or any other network type,topology, and/or protocol. In some embodiments, the technology comprisesuse of a portable device such as a hand-held computer, smartphone,tablet computer, laptop computer, palmtop computer, hiptop computer,e.g., to display results, accept input from a user, provide instructionsto another computer, store data, and/or perform other steps of methodsprovided herein. Some embodiments provide for the use of a thin clientterminal to display results, accept input from a user, provideinstructions to another computer, store data, and/or perform other stepsof methods provided herein.

Some embodiments provide a method for determining a target nucleotidesequence, the method comprising determining n nucleotide subsequences ofthe target nucleotide sequence (indexed over m), wherein the mthnucleotide subsequence has a 5′ end at nucleotide x_(m) of the targetnucleotide sequence and has a 3′ end at nucleotide y_(m) of the targetnucleotide sequence; the (m+1)th nucleotide subsequence has a 5′ end atnucleotide x_(m+1) of the target nucleotide sequence and has a 3′ end atnucleotide y_(m+1) of the target nucleotide sequence; and assembling then nucleotide subsequences to provide a consensus sequence for the targetnucleotide sequence, wherein m ranges from 1 to n; x_(m+1)<y_(m); and(y_(m)−x_(m))<100, 90, 80, 70, 60, 50, 55, 50, 45, 40, 35, or 30 orless, (y_(m+1)−x_(m+1))<100, 90, 80, 70, 60, 50, 55, 50, 45, 40, 35, or30 or less, and (y_(m+1)−y_(m))<20, 10 or less, or less than 5, 4, or 3,or is equal to 1. In some embodiments the fragments are less than 50 bp;accordingly, in some embodiments (y_(m)−x_(m))<50 and(y_(m+1)−x_(m+1))<50. In some embodiments the fragments are less than 40bp; accordingly in some embodiments (y_(m)−x_(m))<40 and(y_(m+1)−x_(m+1))<40. In some embodiments the fragments are less than 30bp; accordingly, in some embodiments (y_(m)−x_(m))<30 and(y_(m+1)−x_(m+1))<30.

In some embodiments the 3′ ends of the fragments differ by 4 or 3 baseswith respect to the target nucleic acid sequence. Accordingly, in someembodiments (y_(m+1)−y_(m))<4 or (y_(m+1)−y_(m))<3. In some embodimentsthe 3′ ends of the fragments differ by 1 base with respect to the targetnucleic acid sequence. Thus, in some embodiments (y_(m+1)−y_(m))=1.

In some embodiments, determining the n nucleotide subsequences comprisespriming from a universal sequence. In some embodiments, determining then nucleotide subsequences comprises terminating polymerization with a3′-O-blocked nucleotide analog. In some embodiments determining thefirst nucleotide subsequence and the second nucleotide subsequencecomprises terminating polymerization with a 3′-O-alkynyl nucleotideanalog. In some embodiments determining the first nucleotide subsequenceand the second nucleotide subsequence comprises terminatingpolymerization with a 3′-O-propargyl nucleotide analog. In someembodiments determining the first nucleotide subsequence and the secondnucleotide subsequence comprises terminating polymerization with anucleotide analog comprising a reversible terminator.

In some embodiments, methods for generating a next-generation sequencinglibrary are provided. In some embodiments the methods compriseamplifying a target nucleotide sequence using a primer comprising atarget specific sequence, a universal sequence A, and a barcodenucleotide sequence associated with the target nucleic acid to providean identifiable amplicon; ligating a first adaptor oligonucleotidecomprising a universal sequence B to the 3′ end of the amplicon to forman adaptor-amplicon; circularizing the adaptor-amplicon to form acircular template; generating a ladder fragment library from thecircular template using a 3′-O-blocked nucleotide analog; and ligating asecond adaptor oligonucleotide comprising a universal sequence C to the3′ ends of the fragments of the ladder fragment library to generate thenext-generation sequencing library (e.g., using a ligase or a chemicalligation by, e.g., click chemistry, e.g., a copper catalyzed reaction ofan alkyne (e.g., a 3′ alkyne) and an azide (e.g., a 5′ azide)).

In some embodiments, the barcode nucleotide sequence comprises 1 to 20nucleotides. In some embodiments, the first adaptor oligonucleotidecomprises 10 to 80 nucleotides. In some embodiments the nucleotidesequences of the fragments of the ladder fragment library correspond tooverlapping nucleotide subsequences within the target nucleotidesequence and the nucleotide sequences of the fragments have 3′ endscorresponding to different nucleotides of the target nucleotidesequence. In some embodiments the nucleotide sequences of the fragmentsof the ladder fragment library comprise less than 100 nucleotides, e.g.,less than 90, 80, 70, 60, 50, or 40 nucleotides, e.g., 15 to 50, e.g.,15 to 40 nucleotides.

In some embodiments the first adaptor oligonucleotide comprises asingle-stranded DNA and/or the second adaptor oligonucleotide comprisesa single-stranded DNA.

In some embodiments generating a ladder fragment library comprises usingan oligonucleotide primer complementary to the universal sequence A.

In some embodiments, the methods further comprise amplifying thenext-generation sequencing library.

In some embodiments the 3′-O-alkynyl nucleotide analog is a3′-O-propargyl nucleotide analog. In some embodiments the nucleotideanalog comprises a reversible terminator.

The technology further provides methods for determining a sequence of anucleic acid. For example, in some embodiments, the method comprisesgenerating a next-generation sequencing library according to thetechnology provided herein; determining a nucleotide sequence of afragment of the ladder fragment library, said nucleotide sequencecomprising a nucleotide subsequence of the target nucleotide sequence;and determining a barcode nucleotide sequence of the fragment of theladder fragment library.

In some embodiments, determining the nucleotide sequence of a fragmentof the ladder fragment library comprises using an oligonucleotide primercomplementary to universal sequence C. In addition, in some embodimentsdetermining the barcode nucleotide sequence of the fragment of theladder fragment library comprises using an oligonucleotide primercomplementary to universal sequence B.

In some embodiments the nucleotide sequence of a fragment of the ladderfragment library comprises less than 100 nucleotides, e.g., 15 to 50nucleotides, e.g., 20 to 50, e.g., 25 to 50, e.g., 30 to 50, e.g., 35 to50, e.g., 40 to 50 nucleotides. In some embodiments the methods furthercomprise associating the barcode nucleotide sequence with a source ofthe target nucleotide sequence.

In some embodiments the methods further comprise collecting or binningnucleotide sequences of fragments of the ladder fragment library havingthe same barcode nucleotide sequence. In some embodiments, the methodsfurther comprise assembling a plurality of nucleotide sequences offragments of the ladder fragment library to provide a consensussequence. In some embodiments the methods further comprise mapping theconsensus sequence to a reference sequence.

In some embodiments, to provide for reconstruction of a consensussequence, the technology includes attaching labels to the nucleic acids,such as nucleic acid binding proteins, optical labels, nucleotideanalogs, and others known in the art.

The technology provides related compositions comprising anext-generation sequencing library, wherein the next-generationsequencing library comprises a plurality of nucleic acids, each nucleicacid comprising a universal sequence A, a barcode nucleotide sequence, asecond universal sequence B, a nucleotide subsequence of a targetnucleotide sequence, and a universal sequence C. In some embodiments thecompositions comprise n nucleic acids, wherein, the mth nucleotidesubsequence has a 5′ end at nucleotide x_(m) of the target nucleotidesequence and has a 3′ end at nucleotide y_(m) of the target nucleotidesequence; the (m+1)th nucleotide subsequence has a 5′ end at nucleotidex_(m+1) of the target nucleotide sequence and has a 3′ end at nucleotidey_(m+1) of the target nucleotide sequence; m ranges from 1 to n;x_(m)=x_(m+1); and (y_(m+1)−y_(m))<20, 10, or less than 5, 4, 3, or 2.In some embodiments the 3′ ends of the fragments of the sequencinglibrary are offset with respect to each other and the target nucleotidesequence by 4 or 3 bases; accordingly, in some embodiments(y_(m+1)−y_(m))<4 or (y_(m+1)−y_(m))<3. In some embodiments the 3′ endsof the fragments of the sequencing library are offset with respect toeach other and the target nucleotide sequence by 1 base; accordingly, insome embodiments (y_(m+1)−y_(m))=1.

In some embodiments, the universal sequence B comprises 10 to 100nucleotides and/or the barcode nucleotide sequence comprises 1 to 20nucleotides.

In some embodiments the compositions further comprise a 3′-O-blockednucleotide analog such as a 3′-O-alkynyl nucleotide analog, e.g., a3′-O-propargyl nucleotide analog. In some embodiments the compositionsfurther comprise a sequencing primer. For example, in some embodimentsthe compositions further comprise a sequencing primer complementary tothe universal sequence C and/or a sequencing primer complementary to theuniversal sequence B.

In some embodiments, the barcode nucleotide sequence is associated withthe target nucleotide sequence. In some embodiments the plurality ofnucleic acids comprises nucleic acids having different barcodenucleotide sequences and different nucleotide subsequences of a targetnucleotide sequence, wherein each barcode nucleotide sequence isassociated with the target nucleotide sequence. In some embodiments, thebarcode nucleotide sequence is associated with one-to-one correspondencewith the target nucleotide sequence.

In some embodiments each nucleic acid of the next-generation sequencinglibrary comprises a 3′-O-blocked nucleotide analog, e.g., a 3′-O-alkynylnucleotide analog, e.g., a 3′-O-propargyl nucleotide analog. In someembodiments each nucleic acid of the next-generation sequencing librarycomprises a nucleotide analog comprising a reversible terminator.

Also provided are kits for producing a NGS sequencing library and/or forobtaining sequence information from a target nucleic acid. In someembodiments of the technology are provided a kit comprising a nucleotideanalog, e.g., for producing a nucleotide fragment ladder according tothe methods provided herein. In some embodiments, the nucleotide analogis a 3′-O-blocked nucleotide analog, e.g., a 3′-O-alkynyl nucleotideanalog, e.g., a 3′-O-propargyl nucleotide analog. In some embodiments,conventional A, C, G, U, and/or T nucleotides are provided in a kit aswell as one or more (e.g., 1, 2, 3, or 4) A, C, G, U, and/or Tnucleotide analogs.

In some embodiments, kits comprise a polymerase (e.g., a naturalpolymerase, a modified polymerase, and/or an engineered polymerase,etc.), e.g., for amplification (e.g., by thermal cycling, isothermalamplification) or for sequencing, etc. In some embodiments, kitscomprise a ligase, e.g., for attaching adaptors to a nucleic acid suchas an amplicon or a ladder fragment or for circularizing anadaptor-amplicon. Some embodiments of kits comprise a copper-basedcatalyst reagent, e.g., for a click chemistry reaction, e.g., to reactan azide and an alkynyl group to form a triazole link. Some kitembodiments provide buffers, salts, reaction vessels, instructions,and/or computer software.

In some embodiments, kits comprise primers and/or adaptors. In someembodiments, the adaptors comprise a chemical modification suitable forattaching the adaptor to the nucleotide analog, e.g., by clickchemistry. For example, in some embodiments, the kit comprises anucleotide analog comprising an alkyne group and an adaptoroligonucleotide comprising an azide (N₃) group. In some embodiments, a“click chemistry” process such as an azide-alkyne cycloaddition is usedto link the adaptor to the fragment via formation of a triazole.

Some embodiments of the technology provide systems for obtainingsequence information. For example, system embodiments comprise anucleotide analog for producing a fragment ladder from a target nucleicacid and a computer readable medium storing instructions for determiningthe sequence of the target nucleic acid from assembling short sequencereads. In some embodiments, systems comprise one or more adaptoroligonucleotides (e.g., suitable for attachment to the nucleotideanalogs) or other kit components as described above.

For example, some system embodiments are associated with assembling(stitching, reconstructing) a nucleic acid sequence. Embodiments of suchsystems include various components such as, e.g., a nucleic acidsequencer, a sample sequence data storage, a reference sequence datastorage, and an analytics computing device/server/node. In someembodiments, the analytics computing device/server/node is aworkstation, mainframe computer, personal computer, mobile device, etc.In some embodiments, the systems comprise functionalities foridentifying a barcode, parsing sequences based on a barcode, and binningsequences having common barcodes.

In some embodiments, the nucleic acid sequencer is configured to analyze(e.g., interrogate) a nucleic acid fragment (e.g., a single fragment, amate-pair fragment, a paired-end fragment, etc.) utilizing all availablevarieties of techniques, platforms, or technologies to obtain nucleicacid sequence information. In some embodiments, the systems comprisefunctionalities for making base calls, assessing quality scores,aligning sequences, identifying a barcode, parsing sequences based on abarcode, and binning sequences having common barcodes.

In various embodiments, the nucleic acid sequencer communicates with thesample sequence data storage either directly via a data cable (e.g., aserial cable, a direct cable connection, etc.) or a bus linkage or,alternatively, through a network connection (e.g., internet, LAN, WAN,WLAN, VPN, etc.). In various embodiments, the network connection is ahardwired physical connection. For example, some embodiments providethat the nucleic acid sequencer is communicatively connected (viaCategory 5 (CATS), fiber optic, or equivalent cabling) to a data serverthat is, in turn, communicatively connected (via CATS, fiber optic, orequivalent cabling) through the internet and to the sample sequence datastorage. In various embodiments, the network connection is a wirelessnetwork connection (e.g., Wi-Fi, WLAN, etc.), for example, utilizing anIEEE 802.11 (e.g., a/b/g/n, etc.) or equivalent transmission format. Inpractice, the network connection utilized is dependent upon theparticular requirements of the system. In various embodiments, thesample sequence data storage is an integrated part of the nucleic acidsequencer.

In some embodiments, the sample sequence data storage is a databasestorage device, system, or implementation (e.g., data storage partition,etc.) that is configured to organize and store nucleic acid sequenceread data generated by a nucleic acid sequencer (e.g., the shortoverlapping sequence reads of less than 300 or less than 200 bases,e.g., ˜30-50 bases and associated index information such as barcodesequence and metadata associated with the barcode such as sample source,type, target nucleic acid, region of interest, experimental conditions,clinical data, etc.) such that the data can be searched (e.g., bybarcode sequence or associated metadata) and retrieved manually (e.g.,by a database administrator/client operator) or automatically by way ofa computer program/application/software script. In various embodiments,the reference data storage can be any database device, storage system,or implementation (e.g., data storage partition, etc.) that isconfigured to organize and store reference sequences (e.g.,whole/partial genome, whole/partial exome, gene, region, chromosome,BAC, etc.) such that the data can be searched and retrieved manually(e.g., by a database administrator/client operator) or automatically byway of a computer program/application/software script. In variousembodiments, the sample nucleic acid sequencing read data is stored onthe sample sequence data storage and/or the reference data storage in avariety of different data file types/formats, including, but not limitedto: *.fasta, *.csfasta, *seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt,*.sms, *srs and/or *.qv.

In some embodiments, the sample sequence data storage and the referencedata storage are independent standalone devices/systems or implementedon different devices. In some embodiments, the sample sequence datastorage and the reference data storage are implemented on the samedevice/system. In some embodiments, the sample sequence data storageand/or the reference data storage are implemented on the analyticscomputing device/server/node.

In some embodiments, the analytics computing device/server/node is incommunication with the sample sequence data storage and the referencedata storage either directly via a data cable (e.g., a serial cable, adirect cable connection, etc.) or a bus linkage or, alternatively,through a network connection (e.g., internet, LAN, WAN, VPN, etc.). Invarious embodiments, the analytics computing device/server/node hosts anassembler, e.g., a reference mapping engine or a de novo mapping module,and/or a tertiary analysis engine.

In some embodiments, the de novo mapping module is configured toassemble sample nucleic acid sequence reads from the sample data storageinto new and previously unknown sequences.

In some embodiments, the reference mapping engine is configured toobtain sample nucleic acid sequence reads (e.g., having a common barcodeand having been binned together) from the sample data storage and mapthem against one or more reference sequences obtained from the referencedata storage to assemble the reads into a sequence that is similar butnot necessarily identical to the reference sequence using all varietiesof reference mapping/alignment techniques and methods. The reassembledsequence can then be further analyzed by one or more optional tertiaryanalysis engines to identify differences in the genetic makeup(genotype, haplotype), gene expression, or epigenetic status ofindividuals that can result in large differences in physicalcharacteristics (phenotype). For example, in various embodiments, thetertiary analysis engine is configured to identify various genomicvariants (in the assembled sequence) due to mutations,recombination/crossover, or genetic drift; to identify phasing ofgenetic information; to identify phylogenetic and/or taxonomicinformation; to identify an individual; to identify a species, genus, orother phylogenetic classification; to identify a drug resistance or adrug susceptibility (sensitivity) marker; to identify a gene fusion; toidentify a copy number variation; to identify a methylation status; toassociate the sequence with a disease state; etc. Examples of types ofgenomic variants include, but are not limited to: single nucleotidepolymorphisms (SNPs), copy number variations (CNVs),insertions/deletions (“indels”), inversions, duplications,translocations, integrations, etc.

It should be understood, however, that the various engines and moduleshosted on the analytics computing device/server/node can be combined orcollapsed into a single engine or module, depending on the requirementsof the particular application or system architecture. Moreover, invarious embodiments, the analytics computing device/server/node hostsadditional engines or modules as needed by the particular application orsystem architecture.

In some embodiments, the mapping and/or tertiary analysis engines areconfigured to process the nucleic acid and/or reference sequence readsin color space. In various embodiments, the mapping and/or tertiaryanalysis engines are configured to process the nucleic acid and/orreference sequence reads in base space. It should be understood,however, that the mapping and/or tertiary analysis engines can processor analyze nucleic acid sequence data in any schema or format as long asthe schema or format conveys the base identity and position of thenucleic acid sequence.

In some embodiments, the sample nucleic acid sequencing read andreferenced sequence data are supplied to the analytics computingdevice/server/node in a variety of different input data filetypes/formats, including, but not limited to: *.fasta, *.csfasta,*seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs and/or *.qv.

Some embodiments provide a client terminal. The client terminal is, insome embodiments, a thin client or, in some embodiments, a thick clientcomputing device. In some embodiments, the client terminal comprises aweb browser (e.g., Internet Explorer, Firefox, Safari, Chrome, etc.)that is used to control the operation of the reference mapping engine,the de novo mapping module, and/or the tertiary analysis engine. Thatis, the client terminal can access the reference mapping engine, the denovo mapping module, and/or the tertiary analysis engine using a browserto control their functions. For example, the client terminal can be usedto configure the operating parameters (e.g., mismatch constraint,quality value thresholds, etc.) of the various engines, depending on therequirements of the particular application. Similarly, the clientterminal can also comprise a display to display the results of theanalysis performed by the assembler, the reference mapping engine, thede novo mapping module, and/or the tertiary analysis engine.

The technology provided herein, in method, composition, kit, and systemembodiments, finds use, e.g., to prepare a NGS library for sequencing,to acquire a nucleotide sequence, to map a single nucleotidepolymorphism, to distinguish alleles, to sequence a genome, to identifyrare minor population variants (e.g., somatic mutations in cancer or alow-abundance pathogen against a large background of host ornon-pathogen DNA), etc.

Sequencing may be by any method known in the art. In certainembodiments, sequencing is sequencing by synthesis. In otherembodiments, sequencing is single molecule sequencing by synthesis. Incertain embodiments, sequencing involves hybridizing a primer to thetemplate to form a template/primer duplex, contacting the duplex with apolymerase enzyme in the presence of detectably labeled nucleotidesunder conditions that permit the polymerase to add nucleotides to theprimer in a template-dependent manner, detecting a signal from theincorporated labeled nucleotide, and sequentially repeating thecontacting and detecting steps at least once, wherein sequentialdetection of incorporated labeled nucleotides determines the sequence ofthe nucleic acid. Exemplary detectable labels include radiolabels,florescent labels, enzymatic labels, etc. In particular embodiments, thedetectable label may be an optically detectable label, such as afluorescent label. Exemplary fluorescent labels (for sequencing and/orother purposes such as labeling a nucleic acid, primer, probe, etc.)include cyanine, rhodamine, fluorescein, coumarin, BODIPY, alexa, orconjugated multi-dyes.

Some embodiments provide a method for generating a next-generationsequencing library, the method comprising amplifying a target nucleotidesequence using a primer comprising a target specific sequence, auniversal sequence A, and a barcode nucleotide sequence (e.g.,comprising 1 to 20 nucleotides) associated with the target nucleic acidto provide an identifiable amplicon; ligating a first adaptoroligonucleotide (e.g., a single-stranded DNA, e.g., comprising 10 to 80nucleotides) comprising a universal sequence B to the 3′ end of theamplicon to form an adaptor-amplicon; circularizing the adaptor-ampliconto form a circular template; generating from the circular template byuse of a primer complementary to the universal sequence A and a3′-O-blocked nucleotide analog (e.g., a 3′-O-alkynyl nucleotide analog,a 3′-O-propargyl nucleotide analog, or comprising a reversibleterminator) a ladder fragment library comprising a plurality offragments; and ligating (e.g., by click chemistry, e.g., using acopper-based catalyst reagent, e.g., to form a triazole from an azideand an alkynyl) a second adaptor oligonucleotide (e.g., asingle-stranded DNA) comprising a universal sequence C to the 3′ ends ofthe fragments of the ladder fragment library to generate anext-generation sequencing library, wherein the nucleotide sequences ofthe fragments of the ladder fragment library comprise 15 to 40nucleotides, the nucleotide sequences of the fragments of the ladderfragment library correspond to overlapping nucleotide subsequenceswithin the target nucleotide sequence, and the nucleotide sequences ofthe fragments of the ladder fragment library have 3′ ends correspondingto different nucleotides of the target nucleotide sequence.

Some embodiments provide a method for determining a target nucleotidesequence, the method comprising amplifying a target nucleotide sequenceusing a primer comprising a target specific sequence, a universalsequence A, and a barcode nucleotide sequence (e.g., comprising 1 to 20nucleotides) associated with the target nucleic acid to provide anamplicon; ligating a first adaptor oligonucleotide (e.g., asingle-stranded DNA, e.g., comprising 10 to 80 nucleotides) comprising auniversal sequence B to the 3′ end of the amplicon to form anadaptor-amplicon; circularizing the adaptor-amplicon to form a circulartemplate; generating from the circular template by use of a primercomplementary to the universal sequence A and a 3′-O-blocked nucleotideanalog (e.g., a 3′-O-alkynyl nucleotide analog, a 3′-O-propargylnucleotide analog, or comprising a reversible terminator) a ladderfragment library comprising a plurality of fragments; ligating (e.g., byclick chemistry, e.g., using a copper-based catalyst reagent, e.g., toform a triazole from an azide and an alkynyl) a second adaptoroligonucleotide (e.g., a single-stranded DNA) comprising a universalsequence C to the 3′ ends of the fragments of the ladder fragmentlibrary to generate a next-generation sequencing library; determining anucleotide sequence of a fragment of the ladder fragment library (e.g.,using an oligonucleotide primer complementary to universal sequence C),said nucleotide sequence comprising a nucleotide subsequence of thetarget nucleotide sequence; determining a barcode nucleotide sequence ofthe fragment of the ladder fragment library (e.g., using anoligonucleotide primer complementary to universal sequence B);associating the barcode nucleotide sequence with a source of the targetnucleotide sequence; binning nucleotide sequences of fragments of theladder fragment library having the same barcode nucleotide sequence;assembling a plurality of nucleotide sequences of fragments of theladder fragment library to provide a consensus sequence; and mapping theconsensus sequence to a reference sequence, wherein the nucleotidesequences of the fragments of the ladder fragment library comprise 15 to50, 15 to 40, or 15 to 30 nucleotides, the nucleotide sequences of thefragments of the ladder fragment library correspond to overlappingnucleotide subsequences within the target nucleotide sequence, thenucleotide sequences of the fragments of the ladder fragment libraryhave 3′ ends corresponding to different nucleotides of the targetnucleotide sequence, and the consensus sequence retains phasing and/orlinkage information of the target nucleic acid.

Some embodiments are related to methods, compositions, kits, and systemsfor sequencing a nucleic acid (e.g., by NGS) by generating anext-generation sequencing library using modified nucleotides, e.g., oneor more 3′-O-modified nucleotides such as 3′-O-alkynyl modifiednucleotides. In some embodiments, the 3′-O-modified nucleotides are3′-O-propargyl nucleotides (e.g., 3′-O-propargyl-dNTP, e.g.,3′-O-propargyl-dATP, 3′-O-propargyl-dCTP, 3′-O-propargyl-dGTP,3′-O-propargyl-dTTP; see, e.g., U.S. patent application Ser. Nos.14/463,412 and 14/463,416; and Int'l Pat. App. PCT/US2014/051726, eachof which is incorporated herein by reference in its entirety for allpurposes). For example, embodiments of the technology are related togenerating a sequencing library (e.g., for NGS) comprising a nucleicacid fragment ladder produced by incorporating chain-terminating3′-O-modified nucleotides by a polymerase during the in vitro synthesisof a nucleic acid.

Particular embodiments are related to generating a nucleic acid fragmentladder using a polymerase reaction comprising standard dNTPs and3′-O-propargyl-dNTPs at a molar ratio of from 1:500 to 500:1 (e.g., aratio of standard dNTPs to 3′-O-propargyl-dNTPs that is 1:500, 1:450,1:400, 1:350, 1:300, 1:250, 1:200, 1:150, 1:100, 1:90, 1:80, 1:70, 1:60,1:50, 1:40, 1:30, 1:20, 1:10, 1:9, 1:8, 1:7, 1:6, 1:5, 1:4, 1:3, 1:2,2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 20:1, 30:1, 40:1, 50:1,60:1, 70:1, 80:1, 90:1, 100:1, 150:1, 200:1, 250:1, 300:1, 350:1, 400:1,450:1, or 500:1). Terminated nucleic acid fragments produced by methodsdescribed herein comprise a propargyl group on their 3′ ends. Furtherembodiments are related to attaching an adaptor to the 3′ ends of thenucleic acid fragments using chemical conjugation. For example, in someembodiments a 5′-azido-modified oligonucleotide (e.g., a5′-azido-methyl-modified oligonucleotide) is conjugated to the3′-propargyl-terminated nucleic acid fragments by click chemistry (e.g.,in a reaction catalyzed by a copper (e.g., copper (I)) reagent). In someembodiments, a target region is first amplified (e.g., by PCR) toproduce a target amplicon for sequencing. In some embodiments,amplifying the target region comprises amplification of the targetregion for 5 to 15 cycles (e.g., a “low-cycle” amplification).

Further embodiments provide that the target amplicon comprises a tag(e.g., comprises a barcode sequence), e.g., the target amplicon is anidentifiable amplicon. In some embodiments, a primer used in theamplification of the target region comprises a tag (e.g., comprising abarcode sequence) that is subsequently incorporated into the targetamplicon (e.g., in a “copy and tag” reaction) to produce an identifiableamplicon. In some embodiments, an adaptor comprising the tag (e.g.,comprising a barcode sequence) is ligated to the target amplicon afteramplification (e.g., in a ligase reaction) to produce an identifiableadaptor-amplicon. In some embodiments, the primer used to produce anidentifiable amplicon in a copy and tag reaction comprises a 3′ regioncomprising a target-specific priming sequence and a 5′ region comprisingtwo different universal sequences (e.g., a universal sequence A and auniversal sequence B) flanking a degenerate sequence. In someembodiments, an adaptor ligated to an amplicon to produce anidentifiable adaptor-amplicon is a double stranded adaptor, e.g.,comprising one strand comprising a degenerate sequence (e.g., comprising8 to 12 bases) flanked on both the 5′ end and the 3′ end by twodifferent universal sequences (e.g., a universal sequence A and auniversal sequence B) and a second strand comprising a universalsequence C (e.g., at the 5′ end) and a sequence (e.g., at the 3′ end)that is complementary to the universal sequence B and that has anadditional T at the 3′-terminal position.

Then, embodiments of the technology provide for the generation ofnucleic acid ladder fragments from the adaptor-amplicon, e.g., toprovide a sequencing library for NGS. In particular, the technologyprovides for the generation of a 3′-O-propargyl-dN terminated nucleicacid ladder for nucleic acid sequencing (e.g., NGS), e.g., by using apolymerase reaction comprising standard dNTPs and 3′-O-propargyl-dNTPsat a molar ratio of from 1:500 to 500:1 (standard dNTPs to3′-O-propargyl-dNTPs). Then, in some embodiments, the technologyprovides for attaching an adaptor to the 3′ ends of the nucleic acidfragments using chemical conjugation. For example, in some embodiments,a 5′-azido-modified oligonucleotide (e.g., a 5′-azido-methyl-modifiedoligonucleotide) is conjugated to the 3′-propargyl-terminated nucleicacid fragments by click chemistry (e.g., in a reaction catalyzed by acopper (e.g., copper (I)) reagent).

Accordingly, some embodiments provide a method for generating anext-generation sequencing library, the method comprising amplifying atarget nucleotide sequence using a primer comprising a target specificsequence, a universal sequence A, a universal sequence B, and a barcodenucleotide sequence (e.g., comprising 1 to 20 nucleotides) associatedwith the target nucleic acid to provide an identifiable amplicon;generating a nucleic acid fragment ladder from the identifiable ampliconusing a 3′-O-blocked nucleotide analog (e.g., a 3′-O-alkynyl nucleotideanalog, a 3′-O-propargyl nucleotide analog); and ligating (e.g., byclick chemistry, e.g., using a copper-based catalyst reagent, e.g., toform a triazole from an azide and an alkynyl) a second adaptoroligonucleotide (e.g., a single-stranded DNA) comprising a universalsequence C to the 3′ ends of the fragments of the ladder fragmentlibrary to generate a next-generation sequencing library, wherein thenucleotide sequences of the fragments of the ladder fragment librarycomprise 15 to 100 nucleotides, the nucleotide sequences of thefragments of the ladder fragment library correspond to overlappingnucleotide subsequences within the target nucleotide sequence, and thenucleotide sequences of the fragments of the ladder fragment libraryhave 3′ ends corresponding to different nucleotides of the targetnucleotide sequence.

Some embodiments provide a method for generating a next-generationsequencing library, the method comprising amplifying a target nucleotidesequence to provide an amplicon; ligating an adaptor (e.g., an adaptorcomprising one strand comprising a degenerate sequence (e.g., comprising8 to 12 bases) flanked on both the 5′ end and the 3′ end by twodifferent universal sequences (e.g., a universal sequence A and auniversal sequence B) and a second strand comprising a universalsequence C (e.g., at the 5′ end) and a sequence (e.g., at the 3′ end)that is complementary to the universal sequence B and that has anadditional T at the 3′-terminal position) to the amplicon to produce anadaptor-amplicon; generating a nucleic acid fragment ladder from theadaptor-amplicon using a 3′-O-blocked nucleotide analog (e.g., a3′-O-alkynyl nucleotide analog, a 3′-O-propargyl nucleotide analog); andligating (e.g., by click chemistry, e.g., using a copper-based catalystreagent, e.g., to form a triazole from an azide and an alkynyl) a secondadaptor oligonucleotide (e.g., a single-stranded DNA) comprising auniversal sequence C to the 3′ ends of the fragments of the ladderfragment library to generate a next-generation sequencing library,wherein the nucleotide sequences of the fragments of the ladder fragmentlibrary comprise 15 to 100 nucleotides, the nucleotide sequences of thefragments of the ladder fragment library correspond to overlappingnucleotide subsequences within the target nucleotide sequence, and thenucleotide sequences of the fragments of the ladder fragment libraryhave 3′ ends corresponding to different nucleotides of the targetnucleotide sequence.

Some embodiments provide a method for determining a target nucleotidesequence, the method comprising amplifying a target nucleotide sequenceusing a primer comprising a target specific sequence, a universalsequence A, a universal sequence B, and a barcode nucleotide sequence(e.g., comprising 1 to 20 nucleotides) associated with the targetnucleic acid to provide an identifiable amplicon; generating a nucleicacid fragment ladder from the identifiable amplicon using a 3′-O-blockednucleotide analog (e.g., a 3′-O-alkynyl nucleotide analog, a3′-O-propargyl nucleotide analog); and ligating (e.g., by clickchemistry, e.g., using a copper-based catalyst reagent, e.g., to form atriazole from an azide and an alkynyl) a second adaptor oligonucleotide(e.g., a single-stranded DNA) comprising a universal sequence C to the3′ ends of the fragments of the ladder fragment library to generate anext-generation sequencing library; determining a nucleotide sequence ofa fragment of the ladder fragment library (e.g., using anoligonucleotide primer complementary to universal sequence C), saidnucleotide sequence comprising a nucleotide subsequence of the targetnucleotide sequence; determining a barcode nucleotide sequence of thefragment of the ladder fragment library; associating the barcodenucleotide sequence with a source of the target nucleotide sequence;binning nucleotide sequences of fragments of the ladder fragment libraryhaving the same barcode nucleotide sequence; assembling a plurality ofnucleotide sequences of fragments of the ladder fragment library toprovide a consensus sequence; and, in some embodiments, mapping theconsensus sequence to a reference sequence, wherein the nucleotidesequences of the fragments of the ladder fragment library comprise 15 to50, 15 to 40, or 15 to 30 nucleotides, the nucleotide sequences of thefragments of the ladder fragment library correspond to overlappingnucleotide subsequences within the target nucleotide sequence, thenucleotide sequences of the fragments of the ladder fragment libraryhave 3′ ends corresponding to different nucleotides of the targetnucleotide sequence, and the consensus sequence retains phasing and/orlinkage information of the target nucleic acid.

Some embodiments provide a method for determining a target nucleotidesequence, the method comprising amplifying a target nucleotide sequenceto provide an amplicon; ligating an adaptor (e.g., an adaptor comprisingone strand comprising a degenerate sequence (e.g., comprising 8 to 12bases) flanked on both the 5′ end and the 3′ end by two differentuniversal sequences (e.g., a universal sequence A and a universalsequence B) and a second strand comprising a universal sequence C (e.g.,at the 5′ end) and a sequence (e.g., at the 3′ end) that iscomplementary to the universal sequence B and that has an additional Tat the 3′-terminal position) to the amplicon to produce anadaptor-amplicon; generating a nucleic acid fragment ladder from theadaptor-amplicon using a 3′-O-blocked nucleotide analog (e.g., a3′-O-alkynyl nucleotide analog, a 3′-O-propargyl nucleotide analog); andligating (e.g., by click chemistry, e.g., using a copper-based catalystreagent, e.g., to form a triazole from an azide and an alkynyl) a secondadaptor oligonucleotide (e.g., a single-stranded DNA) comprising auniversal sequence C to the 3′ ends of the fragments of the ladderfragment library to generate a next-generation sequencing library;determining a nucleotide sequence of a fragment of the ladder fragmentlibrary (e.g., using an oligonucleotide primer complementary touniversal sequence C), said nucleotide sequence comprising a nucleotidesubsequence of the target nucleotide sequence; determining a barcodenucleotide sequence of the fragment of the ladder fragment library;associating the barcode nucleotide sequence with a source of the targetnucleotide sequence; binning nucleotide sequences of fragments of theladder fragment library having the same barcode nucleotide sequence;assembling a plurality of nucleotide sequences of fragments of theladder fragment library to provide a consensus sequence; and, in someembodiments, mapping the consensus sequence to a reference sequence,wherein the nucleotide sequences of the fragments of the ladder fragmentlibrary comprise 15 to 50, 15 to 40, or 15 to 30 nucleotides, thenucleotide sequences of the fragments of the ladder fragment librarycorrespond to overlapping nucleotide subsequences within the targetnucleotide sequence, the nucleotide sequences of the fragments of theladder fragment library have 3′ ends corresponding to differentnucleotides of the target nucleotide sequence, and the consensussequence retains phasing and/or linkage information of the targetnucleic acid.

Some embodiments provide a method for determining a target nucleotidesequence, the method comprising determining a first nucleotidesubsequence of the target nucleotide sequence (e.g., by priming from auniversal sequence and, e.g., terminating polymerization with a3′-O-blocked nucleotide analog such as a 3′-O-alkynyl nucleotide analogor a 3′-O-propargyl nucleotide analog or terminating polymerization witha nucleotide analog comprising a reversible terminator), said firstnucleotide subsequence having a 5′ end at nucleotide x1 of the targetnucleotide sequence and having a 3′ end at nucleotide y1 of the targetnucleotide sequence; determining a second nucleotide subsequence of thetarget nucleotide sequence (e.g., by priming from a universal sequenceand, e.g., terminating polymerization with a 3′-O-blocked nucleotideanalog such as a 3′-O-alkynyl nucleotide analog or a 3′-O-propargylnucleotide analog or terminating polymerization with a nucleotide analogcomprising a reversible terminator), said second nucleotide subsequencehaving a 5′ end at nucleotide x2 of the target nucleotide sequence andhaving a 3′ end at nucleotide y2 of the target nucleotide sequence;assembling the first nucleotide subsequence and the second nucleotidesubsequence to provide a consensus sequence (e.g., comprising 100, 200,300, 400, 500, 600, 700, 800, 900, 1000, or more than 1000, e.g., 2000,2500, 3000, 3500, 4000, 4500, or 5000, or more than 5000 bases) for thetarget nucleotide sequence; identifying a source or sample of the targetnucleotide sequence by decoding a barcode nucleotide sequence; mappingthe consensus sequence (e.g., retaining phasing and/or linkageinformation of the target nucleic acid) to a reference sequence, whereinx2<y1; and (y1−x1)<100 (e.g., (y1−x1)<90, 80, 70, 60, 55, 50, 45, 40,35, or 30), (y2−x2)<100 (e.g., (y1−x1)<90, 80, 70, 60, 55, 50, 45, 40,35, or 30), and (y2−y1)<20 (e.g., (y2−y1)<10, (y2−y1)<5, (y2−y1)<4,(y2−y1)<3, (y2−y1)<2, or (y2−y1=1).

Some embodiments provide a method for determining a target nucleotidesequence, the method comprising determining n nucleotide subsequences ofthe target nucleotide sequence (e.g., by priming from a universalsequence and, e.g., terminating polymerization with a 3′-O-blockednucleotide analog such as a 3′-O-alkynyl nucleotide analog or a3′-O-propargyl nucleotide analog or terminating polymerization with anucleotide analog comprising a reversible terminator), wherein the mthnucleotide subsequence has a 5′ end at nucleotide x_(m) of the targetnucleotide sequence and has a 3′ end at nucleotide y_(m) of the targetnucleotide sequence; and the (m+1)th nucleotide subsequence has a 5′ endat nucleotide x_(m+1) of the target nucleotide sequence and has a 3′ endat nucleotide y_(m+1) of the target nucleotide sequence; assembling then nucleotide subsequences to provide a consensus sequence (e.g.,comprising 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, or morethan 1000 bases, e.g., 2000, 2500, 3000, 3500, 4000, 4500, or 5000 ormore than 5000 bases) for the target nucleotide sequence; identifying asource or sample of the target nucleotide sequence by decoding a barcodenucleotide sequence; and mapping the consensus sequence to a referencesequence, wherein: m ranges from 1 to n; x_(m+1)<y_(m); and(y_(m)−x_(m))<100 (e.g., (y_(m)−x_(m))<90, 80, 70, 60, 55, 50, 45, 40,35, or 30), (y_(m+1)−x_(m+1))<100 (e.g., (y_(m+1)−x_(m+1))<90, 80, 70,60, 55, 50, 45, 40, 35, or 30), and (y_(m+1)−y_(m))<20 (e.g.,(y_(m+1)−y_(m))<10, (y_(m+1)−y_(m))<5, (y_(m+1)−y_(m))<4,(y_(m+1)−y_(m))<3, or (y_(m+1)−y_(m))=1) and the consensus sequenceretains phasing and/or linkage information of the target nucleic acid.

Some embodiments of the technology provide a composition for use as anext-generation sequencing library to obtain a sequence of a targetnucleic acid, the composition comprising a 3′-O-blocked nucleotideanalog, a 3′-O-alkynyl nucleotide analog, a 3′-O-propargyl nucleotideanalog, or a nucleotide analog comprising a reversible terminator; asequencing primer (e.g., complementary to a universal sequence C); asecond sequencing primer (e.g., complementary to a universal sequenceB); and n nucleic acids comprising a 3′-0-blocked nucleotide analog, a3′-O-alkynyl nucleotide analog, or a 3′-O-propargyl nucleotide analoglinked (e.g., by a triazole link formed, e.g., by click chemistry, e.g.,by a reaction between an azide and an alkyl catalyzed by a copper-basedcatalyst) to an adaptor (e.g., a next-generation sequencing adaptoroligonucleotide), or a nucleotide analog comprising a reversibleterminator, wherein each nucleic acid comprises a nucleotide subsequenceof the target nucleic acid, a universal sequence B comprising 10 to 100nucleotides, a universal sequence C comprising 10 to 100 nucleotides,and/or a barcode nucleotide sequence comprising 1 to 20 nucleotides,wherein the mth nucleotide subsequence has a 5′ end at nucleotide x_(m)of the target nucleotide sequence and has a 3′ end at nucleotide y_(m)of the target nucleotide sequence; the (m+1)th nucleotide subsequencehas a 5′ end at nucleotide x_(m+1) of the target nucleotide sequence andhas a 3′ end at nucleotide y_(m+1) of the target nucleotide sequence; mranges from 1 to n; x_(m)=x_(m+1); (y_(m+1)−y_(m))<20 (e.g.,(y_(m+1)−y_(m))<15, (y_(m+1)−y_(m))<10, (y_(m+1)−y_(m))<5,(y_(m+1)−y_(m))<4, (y_(m+1)−y_(m))<3, or (y_(m+1)−y_(m))=1); the nnucleic acids comprises nucleic acids having different barcodenucleotide sequences and different nucleotide subsequence of a targetnucleotide sequence, wherein each barcode nucleotide sequence isassociated (e.g., with one-to-one correspondence) with a targetnucleotide sequence.

Some embodiments of the technology provide a composition for use as anext-generation sequencing library to obtain a sequence of a targetnucleic acid, the composition comprising n nucleic acids (e.g., anucleic acid fragment library), wherein each of the n nucleic acidscomprises a 3′-O-blocked nucleotide analog (e.g., a 3′-O-alkynylnucleotide analog such as a 3′-O-propargyl nucleotide analog). In someembodiments, each nucleic acid of the n nucleic acids comprises anucleotide subsequence of a target nucleotide sequence. In particular,embodiments provide a composition comprising n nucleic acids, whereineach of the n nucleic acids is terminated by a 3′-O-blocked nucleotideanalog (e.g., a 3′-O-alkynyl nucleotide analog such as a 3′-O-propargylnucleotide analog). Further embodiments provide a composition comprisingn nucleic acids (e.g., a nucleic acid fragment library), wherein each ofthe n nucleic acids comprises a 3′-O-blocked nucleotide analog (e.g., a3′-O-alkynyl nucleotide analog such as a 3′-O-propargyl nucleotideanalog) and each of the n nucleic acids is conjugated (e.g., linked) toan oligonucleotide adaptor by a triazole linkage (e.g., a linkage formedfrom a chemical conjugation of a propargyl group and an azido group,e.g., by a click chemistry reaction). For example, some embodimentsprovide a composition comprising n nucleic acids (e.g., a nucleic acidfragment library), wherein each of the n nucleic acids comprises a3′-O-propargyl nucleotide analog (e.g., a 3′-O-propargyl-dA,3′-O-propargyl-dC, 3′-O-propargyl-dG, and/or a 3′-O-propargyl-dT)conjugated (e.g., linked) to an oligonucleotide adaptor by a triazolelinkage (e.g., a linkage formed from a chemical conjugation of apropargyl group and an azido group, e.g., by a click chemistryreaction).

In some embodiments, the composition for use as a next-generationsequencing library to obtain a sequence of a target nucleic acid isproduced by a method comprising synthesizing a n nucleic acids (e.g., anucleic acid fragment library) using a mixture of dNTPs and one or more3′-O-blocked nucleotide analog(s) (e.g., one or more 3′-O-alkynylnucleotide analog(s) such as one or more 3′-O-propargyl nucleotideanalog(s)), e.g., at a molar ratio of from 1:500 to 500:1 (e.g., 1:500,1:450, 1:400, 1:350, 1:300, 1:250, 1:200, 1:150, 1:100, 1:90, 1:80,1:70, 1:60, 1:50, 1:40, 1:30, 1:20, 1:10, 1:9, 1:8, 1:7, 1:6, 1:5, 1:4,1:3, 1:2, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 20:1, 30:1,40:1, 50:1, 60:1, 70:1, 80:1, 90:1, 100:1, 150:1, 200:1, 250:1, 300:1,350:1, 400:1, 450:1, or 500:1). In some embodiments, the composition isproduced using a polymerase obtained from, derived from, isolated from,cloned from, etc. a Thermococcus species (e.g., an organism of thetaxonomic lineage Archaea; Euryarchaeota; Thermococci; Thermococcales;Thermococcaceae; Thermococcus). In some embodiments, the polymerase isobtained from, derived from, isolated from, cloned from, etc. aThermococcus species 9° N-7. In some embodiments, the polymerasecomprises amino acid substitutions that provide for improvedincorporation of modified substrates such as modifieddideoxynucleotides, ribonucleotides, and acyclonucleotides. In someembodiments, the polymerase comprises amino acid substitutions thatprovide for improved incorporation of nucleotide analogs comprisingmodified 3′ functional groups such as the 3′-O-propargyl dNTPs describedherein. In some embodiments the amino acid sequence of the polymerasecomprises one or more amino acid substitutions relative to theThermococcus sp. 9° N-7 wild-type polymerase amino acid sequence, e.g.,a substitution of alanine for the aspartic acid at amino acid position141 (D141A), a substitution of alanine for the glutamic acid at aminoacid position 143 (E143A), a substitution of valine for the tyrosine atamino acid position 409 (Y409V), and/or a substitution of leucine forthe alanine at amino acid position 485 (A485L). In some embodiments, thepolymerase is provided in a heterologous host organism such asEscherichia coli that comprises a cloned Thermococcus sp. 9° N-7polymerase gene, e.g., comprising one or more mutations (e.g., D141A,E143A, Y409V, and/or A485L). In some embodiments, the polymerase is aThermococcus sp. 9° N-7 polymerase sold under the trade name THERMINATOR(e.g., THERMINATOR II) by New England BioLabs (Ipswich, Mass.).

Accordingly, the technology relates to reaction mixtures comprising atarget nucleic acid, a mixture of dNTPs and one or more 3′-O-blockednucleotide analog(s) (e.g., one or more 3′-O-alkynyl nucleotideanalog(s) such as one or more 3′-O-propargyl nucleotide analog(s)),e.g., at a molar ratio of from 1:500 to 500:1 (e.g., 1:500, 1:450,1:400, 1:350, 1:300, 1:250, 1:200, 1:150, 1:100, 1:90, 1:80, 1:70, 1:60,1:50, 1:40, 1:30, 1:20, 1:10, 1:9, 1:8, 1:7, 1:6, 1:5, 1:4, 1:3, 1:2,2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 20:1, 30:1, 40:1, 50:1,60:1, 70:1, 80:1, 90:1, 100:1, 150:1, 200:1, 250:1, 300:1, 350:1, 400:1,450:1, or 500:1), and a polymerase for synthesizing a nucleic acid usingthe dNTPs and one or more 3′-O-blocked nucleotide analog(s) (e.g., apolymerase obtained from, derived from, isolated from, cloned from, etc.a Thermococcus species). In some embodiments, the target nucleic acid isan amplicon. In some embodiments, the target nucleic acid comprises abarcode. In some embodiments, the target nucleic acid is an ampliconcomprising a barcode. In some embodiments, the target nucleic acid is anamplicon ligated to an adaptor comprising a barcode. Some embodimentsprovide reaction mixtures that comprises a plurality of target nucleicacids, each target nucleic acid comprising a barcode associated with anidentifiable characteristic of the target nucleic acid.

Some embodiments provide a reaction mixture composition comprising atemplate (e.g., a circular template, e.g., comprising a universalnucleotide sequence and/or a barcode nucleotide sequence) comprising asubsequence of a target nucleic acid, a polymerase, one or morefragments of a ladder fragment library, and a 3′-O-blocked nucleotideanalog. Some embodiments provide a reaction mixture compositioncomprising a library of nucleic acids, the library of nucleic acidscomprising overlapping short nucleotide sequences tiled over a targetnucleic acid (e.g., the overlapping short nucleotide sequences cover aregion of the target nucleic acid comprising 100 bases, 200 bases, 300bases, 400 bases, 500 bases, 600 bases, 700 bases, 800 bases, 900 bases,1000 bases, or more than 1000 bases, e.g., 2000 bases, 2500 bases, 3000bases, 3500 bases, 4000 bases, 4500 bases, 5000 bases, or more than 5000bases) and offset from one another by 1-20, 1-10, or 1-5 bases (e.g., 1base) and each nucleic acid of the library comprising less than 100bases, less than 90 bases, less than 80 bases, less than 70 bases, lessthan 60 bases, less than 50 bases, less than 45 bases, less than 40bases, less than 35 bases, or less than 30 bases.

Some embodiments provide a kit for generating a sequencing library, thekit comprising an adaptor oligonucleotide comprising a first reactivegroup (e.g., an azide), a 3′-0-blocked nucleotide analog (e.g., a3′-O-alkynyl nucleotide analog or a 3′-O-propargyl nucleotide analog,e.g., comprising an alkyne group, e.g., comprising a second reactivegroup that forms a chemical bond with the first reactive group, e.g.,using click chemistry), a polymerase (e.g., a polymerase for isothermalamplification or thermal cycling), a second adaptor oligonucleotide, oneor more compositions comprising a nucleotide or a mixture ofnucleotides, and a ligase or a copper-based click chemistry catalystreagent.

In some embodiments of kits, kits comprise one or more 3′-O-blockednucleotide analog(s) (e.g., one or more 3′-O-alkynyl nucleotideanalog(s) such as one or more 3′-O-propargyl nucleotide analog(s) andone or more adaptor oligonucleotides comprising an azide group (e.g., a5′-azido oligonucleotide, e.g., a 5′-azido-methyl oligonucleotide). Somekit embodiments further provide a 5′-azido-methyl oligonucleotidecomprising a barcode. Some kit embodiments further provide a pluralityof 5′-azido-methyl oligonucleotides comprising a plurality of barcodes(e.g., each 5′-azido-methyl oligonucleotide comprises a barcode that isdistinguishable from one or more other barcodes of one or more other5′-azido-methyl oligonucleotide(s) comprising a different barcode).Further kit embodiments comprise a click chemistry catalytic reagent(e.g., a copper(I) catalytic reagent).

Some kit embodiments comprise one or more standard dNTPs in addition tothe one or more one or more 3′-O-blocked nucleotide analog(s) (e.g., oneor more 3′-O-alkynyl nucleotide analog(s) such as one or more3′-O-propargyl nucleotide analog(s). For instance, some kit embodimentprovide dATP, dCTP, dGTP, and dTTP, either in separate vessels or as amixture with one or more 3′-O-propargyl-dATP, 3′-O-propargyl-dCTP,3′-O-propargyl-dGTP, and/or 3′-O-propargyl-dATP.

Some kit embodiments further comprise a polymerase obtained from,derived from, isolated from, cloned from, etc. a Thermococcus species(e.g., an organism of the taxonomic lineage Archaea; Euryarchaeota;Thermococci; Thermococcales; Thermococcaceae; Thermococcus). In someembodiments, the polymerase is obtained from, derived from, isolatedfrom, cloned from, etc. a Thermococcus species 9° N-7. In someembodiments, the polymerase comprises amino acid substitutions thatprovide for improved incorporation of modified substrates such asmodified dideoxynucleotides, ribonucleotides, and acyclonucleotides. Insome embodiments, the polymerase comprises amino acid substitutions thatprovide for improved incorporation of nucleotide analogs comprisingmodified 3′ functional groups such as the 3′-O-propargyl dNTPs describedherein. In some embodiments the amino acid sequence of the polymerasecomprises one or more amino acid substitutions relative to theThermococcus sp. 9° N-7 wild-type polymerase amino acid sequence, e.g.,a substitution of alanine for the aspartic acid at amino acid position141 (D141A), a substitution of alanine for the glutamic acid at aminoacid position 143 (E143A), a substitution of valine for the tyrosine atamino acid position 409 (Y409V), and/or a substitution of leucine forthe alanine at amino acid position 485 (A485L). In some embodiments, thepolymerase is provided in a heterologous host organism such asEscherichia coli that comprises a cloned Thermococcus sp. 9° N-7polymerase gene, e.g., comprising one or more mutations (e.g., D141A,E143A, Y409V, and/or A485L). In some embodiments, the polymerase is aThermococcus sp. 9° N-7 polymerase sold under the trade name THERMINATOR(e.g., THERMINATOR II) by New England BioLabs (Ipswich, Mass.).

Accordingly, some kit embodiments comprise one or more 3′-O-propargylnucleotide analog(s) (e.g., one or more of 3′-O-propargyl-dATP,3′-O-propargyl-dCTP, 3′-O-propargyl-dGTP, and/or 3′-O-propargyl-dATP), amixture of standard dNTPs (e.g., dATP, dCTP, dGTP, and dTTP), one ormore 5′-azido-methyl oligonucleotide adaptors, a polymerase obtainedfrom, derived from, isolated from, cloned from, etc. a Thermococcusspecies, and a click chemistry catalyst for forming a triazole from anazide group and an alkyl group. In some embodiments, the one or more3′-O-propargyl nucleotide analog(s) (e.g., one or more of3′-O-propargyl-dATP, 3′-O-propargyl-dCTP, 3′-O-propargyl-dGTP, and/or3′-O-propargyl-dATP) and the mixture of standard dNTPs (e.g., dATP,dCTP, dGTP, and dTTP) are provided together, e.g., the kit comprises asolution comprising the one or more 3′-O-propargyl nucleotide analog(s)(e.g., one or more of 3′-O-propargyl-dATP, 3′-O-propargyl-dCTP,3′-O-propargyl-dGTP, and/or 3′-O-propargyl-dATP) and the mixture ofstandard dNTPs (e.g., dATP, dCTP, dGTP, and dTTP). In some embodiments,the solution comprises the one or more 3′-O-propargyl nucleotideanalog(s) (e.g., one or more of 3′-O-propargyl-dATP,3′-O-propargyl-dCTP, 3′-O-propargyl-dGTP, and/or 3′-O-propargyl-dATP)and the mixture of standard dNTPs (e.g., dATP, dCTP, dGTP, and dTTP) ata ratio of from 1:500 to 500:1 (e.g., 1:500, 1:450, 1:400, 1:350, 1:300,1:250, 1:200, 1:150, 1:100, 1:90, 1:80, 1:70, 1:60, 1:50, 1:40, 1:30,1:20, 1:10, 1:9, 1:8, 1:7, 1:6, 1:5, 1:4, 1:3, 1:2, 2:1, 3:1, 4:1, 5:1,6:1, 7:1, 8:1, 9:1, 10:1, 20:1, 30:1, 40:1, 50:1, 60:1, 70:1, 80:1,90:1, 100:1, 150:1, 200:1, 250:1, 300:1, 350:1, 400:1, 450:1, or 500:1).

Some embodiments of kits further comprise software for processingsequence data, e.g., to extract nucleotide sequence data from the dataproduced by a sequencer; to identify barcodes and target subsequencesfrom the data produced by a sequencer; to align and/or assemblesubsequences from the data produced by a sequencer to produce aconsensus sequence; and/or to align subsequences and/or a consensussequence to a reference sequence (e.g., to identify sequence differences(e.g., to identify alleles, homologs, phylogenetic relationships,chromosomes, sequence similarities or differences, mutations, and/orsequencing errors, etc.) and/or to correct sequence anomalies (e.g.,sequencing errors)).

Some embodiments provide a system for sequencing a target nucleic acid,the system comprising an adaptor oligonucleotide comprising a firstreactive group (e.g., an azide), a 3′-0-blocked nucleotide analog (e.g.,a 3′-O-alkynyl nucleotide analog or a 3′-O-propargyl nucleotide analog,e.g., comprising an alkyne group and, e.g., comprising a second reactivegroup that forms a chemical bond with the first reactive group, e.g.,using click chemistry, e.g., using a copper-based click chemistrycatalyst), a sequencing apparatus, a nucleic acid fragment ladder (e.g.,comprising a plurality of nucleic acids having 3′ ends that differ byless than 20 nucleotides, less than 10 nucleotides, less than 5nucleotides, less than 4 nucleotides, less than 3 nucleotides, or by 1nucleotide), and software for assembling short overlapping nucleotidesequences into a consensus sequence, wherein each short nucleotidesequence has less than 100, less than 90, less than 80, less than 70,less than 60, less than 50, less than 45, less than 40, less than 35, orless than 30 bases; the short nucleotide sequences are tiled over atarget nucleic acid having at least 100, 200, 300, 400, 500, 600, 700,800, 900, 1000, 2000, 2500, 3000, 3500, 4000, 5000, or more than 5000bases; and the short nucleotide sequences are offset from one another by1-20, 1-10, or 1-5 bases.

Additional embodiments will be apparent to persons skilled in therelevant art based on the teachings contained herein.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features, aspects, and advantages of the presenttechnology will become better understood with regard to the followingdrawings:

FIG. 1 is a schematic depicting an embodiment of the technology forsequencing a nucleic acid.

FIGS. 2A-2C are schematics depicting an embodiment of the technology forproducing a library for next-generation sequencing. FIG. 2A shows oneembodiment of the technology and FIG. 2B shows another embodiment of thetechnology. FIG. 2C shows another embodiment of the technology.

FIG. 3 is a schematic depicting an embodiment of the technology forsequencing a nucleic acid.

FIG. 4 is a schematic depicting an embodiment of the technology forsequencing a nucleic acid.

FIG. 5 shows flowcharts relating to embodiments of the technology thatfind use in sequencing a nucleic acid. FIG. 5A is a flowchart showing anembodiment of the technology comprising obtaining sequence data from aNGS library and extracting the overlapping subsequences of the targetsequence. FIG. 5B is a flowchart showing an embodiment of the technologyfor extracting sequence data comprising concatenating sequence datafiles, identifying and extracting target sequence, and aligning thetarget sequences to provide a consensus sequence.

FIG. 6 shows predicted and experimental coverage of a target sequence bythe short sequence reads produced by embodiments of the technology. FIG.6A shows sequence alignment of 40-bp reads and the correspondingsequence coverage profile. The consensus and reference sequences arealso shown (a 177-bp sequence comprising exon 2 of human KRAS andpartial flanking intron sequences). FIG. 6B shows the predicted shortread sequence alignment and corresponding sequence coverage profile fora theoretical template reference sequence.

FIG. 7 shows a schematic of an embodiment of the technology related to a“copy and tag” scheme using polymerase extension of a primer comprisinga barcode sequence and universal sequences.

FIG. 8 shows a scheme for the experimental detection of “copy and tag”reaction products and for evaluation the effectiveness of the polymeraseextension blocker.

FIG. 9 shows a scheme for an adaptor ligation based molecular barcodingstrategy according to particular embodiments of the technology.

FIG. 10 shows a scheme for the experimental detection of adaptor ligatedproducts.

FIG. 11 shows a scheme for intramolecular ligation (circularization) ofsingle stranded DNA as a step in generating ladder fragments accordingto the technology provided herein.

FIG. 12 shows a scheme for the experimental detection of circulartemplates related to embodiments of the technology related to thegeneration of circular templates for fragment ladder generation.

It is to be understood that the figures are not necessarily drawn toscale, nor are the objects in the figures necessarily drawn to scale inrelationship to one another. The figures are depictions that areintended to bring clarity and understanding to various embodiments ofapparatuses, systems, and methods disclosed herein. Wherever possible,the same reference numbers will be used throughout the drawings to referto the same or like parts. Moreover, it should be appreciated that thedrawings are not intended to limit the scope of the present teachings inany way.

DETAILED DESCRIPTION

The technology generally relates to obtaining a nucleotide sequence,such as a consensus sequence or a haplotype sequence. In someembodiments provided herein is technology to produce a library of shortoverlapping DNA fragments from a larger target DNA fragment to besequenced. The short overlapping DNA fragments have a range of lengthssuch that one fragment differs from another fragment by 1-5 bases,preferably 1 base, at their 3′ ends (e.g., a fragment ladder similar tothat produced by conventional Sanger sequencing methods). In someembodiments, the short overlapping DNA fragments are indexed to generatea next generation sequencing (NGS) library. The library finds use inperforming NGS by initiating sequencing reactions from the varying 3′ends of the DNA fragments. Acquiring ˜30-base to ˜50-base sequence readsfrom the 3′ ends of the short overlapping fragments produces a tiled setof ˜30-base to ˜50-base sequence reads spanning the larger target DNA tobe sequenced and offset from one another by 1-5 bases, preferably offsetby 1 base. Assembling the overlapping ˜30-50 bp short sequence readsproduces a long contiguous read covering a larger region (˜800-1000 bp)of the target DNA fragment. Thus, each sequence read results from thehighest quality bases produced by NGS (e.g., the first 20-100 bases) andeach base of the assembly is the consensus of 30-50 independent highquality sequence reads.

In the description of this technology, the section headings used hereinare for organizational purposes only and are not to be construed aslimiting the described subject matter in any way.

In this detailed description of the various embodiments, for purposes ofexplanation, numerous specific details are set forth to provide athorough understanding of the embodiments disclosed. One skilled in theart will appreciate, however, that these various embodiments may bepracticed with or without these specific details. In other instances,structures and devices are shown in block diagram form. Furthermore, oneskilled in the art can readily appreciate that the specific sequences inwhich methods are presented and performed are illustrative and it iscontemplated that the sequences can be varied and still remain withinthe spirit and scope of the various embodiments disclosed herein.

All literature and similar materials cited in this application,including but not limited to, patents, patent applications, articles,books, treatises, and internet web pages are expressly incorporated byreference in their entirety for any purpose. Unless defined otherwise,all technical and scientific terms used herein have the same meaning asis commonly understood by one of ordinary skill in the art to which thevarious embodiments described herein belongs. When definitions of termsin incorporated references appear to differ from the definitionsprovided in the present teachings, the definition provided in thepresent teachings shall control.

Definitions

To facilitate an understanding of the present technology, a number ofterms and phrases are defined below. Additional definitions are setforth throughout the detailed description.

Throughout the specification and claims, the following terms take themeanings explicitly associated herein, unless the context clearlydictates otherwise. The phrase “in one embodiment” as used herein doesnot necessarily refer to the same embodiment, though it may.Furthermore, the phrase “in another embodiment” as used herein does notnecessarily refer to a different embodiment, although it may. Thus, asdescribed below, various embodiments of the invention may be readilycombined, without departing from the scope or spirit of the invention.

In addition, as used herein, the term “or” is an inclusive “or” operatorand is equivalent to the term “and/or” unless the context clearlydictates otherwise. The term “based on” is not exclusive and allows forbeing based on additional factors not described, unless the contextclearly dictates otherwise. In addition, throughout the specification,the meaning of “a”, “an”, and “the” include plural references. Themeaning of “in” includes “in” and “on.”

As used herein, a “nucleotide” comprises a “base” (alternatively, a“nucleobase” or “nitrogenous base”), a “sugar” (in particular, afive-carbon sugar, e.g., ribose or 2-deoxyribose), and a “phosphatemoiety” of one or more phosphate groups (e.g., a monophosphate, adiphosphate, or a triphosphate consisting of one, two, or three linkedphosphates, respectively). Without the phosphate moiety, the nucleobaseand the sugar compose a “nucleoside”. A nucleotide can thus also becalled a nucleoside monophosphate or a nucleoside diphosphate or anucleoside triphosphate, depending on the number of phosphate groupsattached. The phosphate moiety is usually attached to the 5-carbon ofthe sugar, though some nucleotides comprise phosphate moieties attachedto the 2-carbon or the 3-carbon of the sugar. Nucleotides contain eithera purine (in the nucleotides adenine and guanine) or a pyrimidine base(in the nucleotides cytosine, thymine, and uracil). Ribonucleotides arenucleotides in which the sugar is ribose. Deoxyribonucleotides arenucleotides in which the sugar is deoxyribose.

As used herein, a “nucleic acid” shall mean any nucleic acid molecule,including, without limitation, DNA, RNA, and hybrids thereof. Thenucleic acid bases that form nucleic acid molecules can be the bases A,C, G, T and U, as well as derivatives thereof. Derivatives of thesebases are well known in the art. The term should be understood toinclude, as equivalents, analogs of either DNA or RNA made fromnucleotide analogs. The term as used herein also encompasses cDNA, thatis complementary, or copy, DNA produced from an RNA template, forexample by the action of a reverse transcriptase.

As used herein, “nucleic acid sequencing data”, “nucleic acid sequencinginformation”, “nucleic acid sequence”, “genomic sequence”, “geneticsequence”, “fragment sequence”, or “nucleic acid sequencing read”denotes any information or data that is indicative of the order of thenucleotide bases (e.g., adenine, guanine, cytosine, and thymine/uracil)in a molecule (e.g., a whole genome, a whole transcriptome, an exome,oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA.

It should be understood that the present teachings contemplate sequenceinformation obtained using all available varieties of techniques,platforms or technologies, including, but not limited to: capillaryelectrophoresis, microarrays, ligation-based systems, polymerase-basedsystems, hybridization-based systems, direct or indirect nucleotideidentification systems, pyrosequencing, ion- or pH-based detectionsystems, electronic signature-based systems, etc.

Reference to a base, a nucleotide, or to another molecule may be in thesingular or plural. That is, “a base” may refer to a single molecule ofthat base or to a plurality of the base, e.g., in a solution.

A “polynucleotide”, “nucleic acid”, or “oligonucleotide” refers to alinear polymer of nucleosides (including deoxyribonucleosides,ribonucleosides, or analogs thereof) joined by internucleosidiclinkages. Typically, a polynucleotide comprises at least threenucleosides. Usually oligonucleotides range in size from a few monomericunits, e.g. 3-4, to several hundreds of monomeric units. Whenever apolynucleotide such as an oligonucleotide is represented by a sequenceof letters, such as “ATGCCTG,” it will be understood that thenucleotides are in 5′ to 3′ order from left to right and that “A”denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotesdeoxyguanosine, and “T” denotes thymidine, unless otherwise noted. Theletters A, C, G, and T may be used to refer to the bases themselves, tonucleosides, or to nucleotides comprising the bases, as is standard inthe art.

As used herein, the term “target nucleic acid” or “target nucleotidesequence” refers to any nucleotide sequence (e.g., RNA or DNA), themanipulation of which may be deemed desirable for any reason by one ofordinary skill in the art. In some contexts, “target nucleic acid”refers to a nucleotide sequence whose nucleotide sequence is to bedetermined or is desired to be determined. In some contexts, the term“target nucleotide sequence” refers to a sequence to which a partiallyor completely complementary primer or probe is generated.

As used herein, the term “region of interest” refers to a nucleic acidthat is analyzed (e.g., using one of the compositions, systems, ormethods described herein). In some embodiments, the region of interestis a portion of a genome or region of genomic DNA (e.g., comprising oneor chromosomes or one or more genes). In some embodiments, mRNAexpressed from a region of interest is analyzed.

As used herein, the term “corresponds to” or “corresponding” is used inreference to a contiguous nucleic acid or nucleotide sequence (e.g., asubsequence) that is complementary to, and thus “corresponds to”, all ora portion of a target nucleic acid sequence.

As used herein, the phrase “a clonal plurality of nucleic acids” refersto the nucleic acid products that are complete or partial copies of atemplate nucleic acid from which they were generated. These products aresubstantially or completely or essentially identical to each other, andthey are complementary copies of the template nucleic acid strand fromwhich they are synthesized, assuming that the rate of nucleotidemisincorporation during the synthesis of the clonal nucleic acidmolecules is 0%.

As used herein, the term “library” refers to a plurality of nucleicacids, e.g., a plurality of different nucleic acids.

As used herein, a “subsequence” of a nucleotide sequence refers to anynucleotide sequence contained within the nucleotide sequence, includingany subsequence having a size of a single base up to a subsequence thatis one base shorter than the nucleotide sequence.

As used herein, the term “consensus sequence” refers to a sequence thatis common to, or otherwise present in the largest fraction, of analigned group of sequences. The consensus sequence shows the nucleotidemost commonly found at each position within the nucleic acid sequencesof the group of sequences. A consensus sequence is often “assembled”from shorter sequence reads.

As used herein, “assembly” refers to generating nucleotide sequenceinformation from shorter sequences, e.g., experimentally acquiredsequence reads. Sequence assembly can generally be divided into twobroad categories: de novo assembly and reference genome mappingassembly. In de novo assembly, sequence reads are assembled together sothat they form a new and previously unknown sequence. In referencegenome “mapping”, sequence reads are assembled against an existing“reference sequence” to build a sequence that is similar to but notnecessarily identical to the reference sequence.

The phrase “sequencing run” refers to any step or portion of asequencing experiment performed to determine some information relatingto at least one biomolecule (e.g., nucleic acid molecule).

As used herein, the phrase “dNTP” means deoxynucleotidetriphosphate,where the nucleotide comprises a nucleotide base, such as A, T, C, G orU.

The term “monomer” as used herein means any compound that can beincorporated into a growing molecular chain by a given polymerase. Suchmonomers include, without limitations, naturally occurring nucleotides(e.g., ATP, GTP, TTP, UTP, CTP, dATP, dGTP, dTTP, dUTP, dCTP, syntheticanalogs), precursors for each nucleotide, non-naturally occurringnucleotides and their precursors or any other molecule that can beincorporated into a growing polymer chain by a given polymerase.

As used herein, “complementary” generally refers to specific nucleotideduplexing to form canonical Watson-Crick base pairs, as is understood bythose skilled in the art. However, complementary also includesbase-pairing of nucleotide analogs that are capable of universalbase-pairing with A, T, G or C nucleotides and locked nucleic acids thatenhance the thermal stability of duplexes. One skilled in the art willrecognize that hybridization stringency is a determinant in the degreeof match or mismatch in the duplex formed by hybridization.

A “polymerase” is an enzyme generally for joining 3′-OH 5′-triphosphatenucleotides, oligomers, and their analogs. Polymerases include, but arenot limited to, DNA-dependent DNA polymerases, DNA-dependent RNApolymerases, RNA-dependent DNA polymerases, RNA-dependent RNApolymerases, T7 DNA polymerase, T3 DNA polymerase, T4 DNA polymerase, T7RNA polymerase, T3 RNA polymerase, SP6 RNA polymerase, DNA polymerase 1,Klenow fragment, Thermophilus aquaticus (Taq) DNA polymerase, Thermusthermophilus (Tth) DNA polymerase, Vent DNA polymerase (New EnglandBiolabs), Deep Vent DNA polymerase (New England Biolabs), Bacillusstearothermophilus (Bst) DNA polymerase, DNA Polymerase Large Fragment,Stoeffel Fragment, 9° N DNA Polymerase, 9° Nm polymerase, Pyrococcusfuriosis (Pfu) DNA Polymerase, Thermus filiformis (Tfl) DNA Polymerase,RepliPHI Phi29 Polymerase, Thermococcus litorialis (Tli) DNA polymerase,eukaryotic DNA polymerase beta, telomerase, Therminator (e.g.,THERMINATOR I, THERMINATOR II, etc.) polymerase (New England Biolabs),KOD HiFi. DNA polymerase (Novagen), KOD1 DNA polymerase, Q-betareplicase, terminal transferase, AMV reverse transcriptase, M-MLVreverse transcriptase, Phi6 reverse transcriptase, HIV-1 reversetranscriptase, novel polymerases discovered by bioprospecting and/ormolecular evolution, and polymerases cited in U.S. Pat. Appl. Pub. No.2007/0048748 and in U.S. Pat. Nos. 6,329,178; 6,602,695; and 6,395,524.These polymerases include wild-type, mutant isoforms, and geneticallyengineered variants such as exo-polymerases; polymerases with minimized,undetectable, and/or decreased 3′→5′ proofreading exonuclease activity,and other mutants, e.g., that tolerate labeled nucleotides andincorporate them into a strand of nucleic acid. In some embodiments, thepolymerase is designed for use, e.g., in real-time PCR, high fidelityPCR, next-generation DNA sequencing, fast PCR, hot start PCR, crudesample PCR, robust PCR, and/or molecular diagnostics. Such enzymes areavailable from many commercial suppliers, e.g., Kapa Enzymes, Finnzymes,Promega, Invitrogen, Life Technologies, Thermo Scientific, Qiagen,Roche, etc.

The term “primer” refers to an oligonucleotide, whether occurringnaturally as in a purified restriction digest or produced synthetically,that is capable of acting as a point of initiation of synthesis whenplaced under conditions in which synthesis of a primer extension productthat is complementary to a nucleic acid strand is induced, (e.g., in thepresence of nucleotides and an inducing agent such as DNA polymerase andat a suitable temperature and pH). The primer is preferably singlestranded for maximum efficiency in amplification, but may alternativelybe double stranded. If double stranded, the primer is first treated toseparate its strands before being used to prepare extension products.Preferably, the primer is an oligodeoxyribonucleotide. The primer mustbe sufficiently long to prime the synthesis of extension products in thepresence of the inducing agent. The exact lengths of the primers willdepend on many factors, including temperature, source of primer and theuse of the method.

As used herein, an “adaptor” is an oligonucleotide that is linked or isdesigned to be linked to a nucleic acid to introduce the nucleic acidinto a sequencing workflow. An adaptor may be single-stranded ordouble-stranded (e.g., a double-stranded DNA or a single-stranded DNA).As used herein, the term “adaptor” refers to the adaptor nucleic in astate that is not linked to another nucleic acid and in a state that islinked to a nucleic acid.

At least a portion of the adaptor comprises a known sequence. Forexample, some embodiments of adaptors comprise a primer binding sequencefor amplification of the nucleic acid and/or for binding of a sequencingprimer. Some adaptors comprise a sequence for hybridization of acomplementary capture probe. Some adaptors comprise a chemical or othermoiety (e.g., a biotin moiety) for capture and/or immobilization to asolid support (e.g., comprising an avidin moiety). Some embodiments ofadaptors comprise a marker, index, barcode, tag, or other sequence bywhich the adaptor and a nucleic acid to which it is linked areidentifable.

Some adaptors comprise a universal sequence. A universal sequence is asequence shared by a plurality of adaptors that may otherwise havedifferent sequences outside of the universal sequence. For example, auniversal sequence provides a common primer binding site for acollection of nucleic acids from different target nucleic acids, e.g.,that may comprise different barcodes.

Some embodiments of adaptors comprise a defined but unknown sequence.For example, some embodiments of adaptors comprise a degenerate sequenceof a defined number of bases (e.g., a 1- to 20-base degeneratesequence). Such a sequence is defined even if each individual sequenceis not known—such a sequence may nevertheless serve as an index,barcode, tag, etc. marking nucleic acid fragments from, e.g., the sametarget nucleic acid.

Some adaptors comprise a blunt end and some adaptors comprise an endwith an overhang of one or more bases.

In particular embodiments provided herein, an adaptor comprises an azidomoiety, e.g., the adaptor comprises an azido (e.g., an azido-methyl)moiety on its 5′ end. Thus, some embodiments are related to adaptorsthat are or that comprise a 5′-azido-modified oligonucleotide or a5′-azido-methyl-modified oligonucleotide.

As used herein, a “system” denotes a set of components, real orabstract, comprising a whole where each component interacts with or isrelated to at least one other component within the whole.

As used herein, “index” shall generally mean a distinctive oridentifying mark or characteristic. One example of an index is a shortnucleotide sequence used as a “barcode” to identify a longer nucleotidecomprising the barcode and other sequence.

As used herein, the term “phase” or “phasing” refers to the uniquecontent of the two chromosomes inherited from each parent and/orseparating maternally and paternally derived sequence informationpresent on a nucleic acid (e.g., a chromosome) For example, haploytpephasing information describes which nucleotides (e.g., a SNP), regions,portions, or fragments originated from each of the parental chromosomes(or are associated with a specific minor viral quasi-species).

As used herein a “Sanger ladder”, “DNA ladder”, “fragment ladder”, or“ladder” refers to a library of nucleic acids (e.g., DNA) that eachdiffer in length by a small number of bases, e.g., one to five bases andin some preferred embodiments by one base. In some embodiments, thenucleic acids in the ladder have 5′ ends that correspond to the samenucleotide position (or fall within a small range of nucleotidepositions, e.g., 1-10 nucleotide positions) in the template from whichthey were made and have different 3′ ends that correspond to a range ofnucleotide positions in the template from which they were made. See,e.g., exemplary ladders and/or ladders similar to those provided hereinin Sanger & Coulson (1975) “A rapid method for determining sequences inDNA by primed synthesis with DNA polymerase” J Mol Biol 94(3):441-8;Sanger et al (1977) “DNA sequencing with chain-terminating inhibitors”Proc Natl Acad Sci USA 74 (12): 5463-7.

DESCRIPTION

In some embodiments, the technology provided herein provides methods andcompositions to create short overlapping DNA fragments that span over alarger region of DNA fragment. In particular, the short DNA fragmentscompose a population of DNA fragments having a range of sizes thatincrease in size from one fragment to the next larger fragment by, forexample, 1 to 20 base pairs, 1 to 10 base pairs, or 1 to 5 base pairs,preferably by 1 base pair (e.g., as in the case of fragments generatedby Sanger sequencing). In some embodiments, a short nucleic acid havinga universal sequence is appended to the 3′ ends of each fragment (e.g.,the end of the fragment where the ladder is generated). Subsequently,the fragments are sequenced using a sequencing primer complementary tothe universal sequence. As such, the sequences generated have a range of5′ (first) bases corresponding to bases distributed along the length ofthe larger DNA from the first base attached to the universal sequence upto 500 bases or more. Preferably, the sequences generated have a rangeof 5′ (first) bases corresponding to each base distributed along thelength of the larger DNA. With this method, short NGS reads (˜30 to ˜50bases) are used to assemble a long contiguous read that retains phaseand/or linkage information (see, e.g., FIG. 1).

1. Methods of Producing Libraries for NGS

Embodiments of the technology are depicted by the schematic shown inFIG. 2. First, in some embodiments, a target nucleic acid is amplifiedusing one or more target specific primers (see, e.g., FIG. 2A, step i;FIG. 2C, step i). The target nucleic acid may be a DNA or an RNA, e.g.,a genomic DNA; mRNA; a cosmid, fosmid, or bacterial artificialchromosome (e.g., comprising an insert), a gene, a plasmid, etc. In someembodiments, an RNA is first reverse transcribed to produce a DNA.Amplification may be PCR, limited cycle (low cycle, e.g., 5-15 cycles(e.g., 8 cycles)) PCR, isothermal PCR, amplification with Phi29 or Bstenzymes, etc., e.g., as shown in FIG. 2A and in FIG. 2C.

In some embodiments, the target specific primers include both auniversal sequence (e.g., universal sequence A) and a uniquelyidentifying index sequence (e.g., a barcode sequence; see FIG. 2A,“NNNNN” barcode sequence) that allows tracking and/or identifying thetarget nucleic acid from which the amplified product (amplicon) wasproduced. Generally, barcode sequences may consist of 1 to 10 or morenucleotides. For example, a 10-base barcode sequence provides 1,048,576(4¹⁰) combinations of uniquely identifiable target-specific primermolecules. Consequently, with an appropriately designed barcode length,a starting material containing a small to a very large number of targetDNA fragments can be reliably tagged and indexed without duplicatetagging with the same barcode sequence.

In some embodiments, the primers are used for amplification (e.g., donot comprise a barcode) and the target amplicon is ligated to an adaptorthat comprises one or more universal sequences and/or one or morebarcode sequences (see, e.g., FIG. 2C, “NNNNNNNNNN” barcode sequence,step Thus, in some embodiments a next step comprises ligating an adaptorto the target amplicon. In some embodiments, the adaptor comprises firststrand comprising a stretch of degenerate sequence (e.g., comprising 8to 12 bases) flanked on both the 5′ end and the 3′ end by two differentuniversal sequences (e.g., universal sequence A and universal sequenceB; see FIG. 9) and a second strand comprising a universal sequence C(e.g., at the 5′ end) and a sequence (e.g., at the 3′ end) that iscomplementary to universal sequence B and that has an additional T atthe 3′-terminal position.

Embodiments are provided herein for producing a fragment ladder from acircularized template (see, e.g., FIG. 2A and FIG. 2B) and embodimentsare provided herein for producing a fragment ladder from a lineartemplate (see, e.g., FIG. 2C). Accordingly, in some embodiments, a nextstep comprises ligating the uniquely barcoded individual amplicons attheir 3′ ends to an adaptor oligonucleotide approximately 10 to 80 basesin length and comprising a second universal sequence (e.g., universalsequence B) (see, e.g., FIG. 2A, step After ligation, theadaptor-amplicon nucleic acids are self-ligated (e.g., circularized) toform a circular template (see, e.g., FIG. 2A, step ii). Thecircularization brings the universal sequence at the 3′ end adjacent tothe barcode sequence at the 5′ end. Intramolecular ligation may beeffected using a ligase. For example, CircLigase II (Epicentre) is athermostable single-stranded DNA ligase that catalyzes intramolecularligation of single-stranded DNA templates having a 5′ phosphate and a 3′hydroxyl group.

Then, in embodiments related to using a circularized template, a Sangerfragment-like DNA ladder is generated by a polymerase reaction using aprimer complementary to universal sequence A and a mix of dNTPs and3′-O-blocked dNTP analogs as described herein (see, e.g., FIG. 2A, stepiv). In some embodiments, the 3′-O-blocked dNTP analog is a 3′-O-alkynylnucleotide analog (e.g., an alkyl, having a saturated position(sp³-hybridized) on a molecular framework next to an alkynyl group, andsubstituted variants thereof). In some embodiments, the 3′-O-blockeddNTP analog is a 3′-O-propargyl nucleotide analog having a structure asshown below:

where B is the base of the nucleotide (e.g., adenine, guanine, thymine,cytosine, or a natural or synthetic nucleobase, e.g., a modified purinesuch as hypoxanthine, xanthine, 7-methylguanine; a modified pyrimidinesuch as 5,6-dihydrouracil, 5-methylcytosine, 5-hydroxymethylcytosine;etc.) and P comprises a phosphate moiety. In some embodiments, Pcomprises a tetraphosphate; a triphosphate; a diphosphate; amonophosphate; a 5′ hydroxyl; an alpha thiophosphate (e.g.,phosphorothioate or phosphorodithioate), a beta thiophosphate (e.g.,phosphorothioate or phosphorodithioate), and/or a gamma thiophosphate(e.g., phosphorothioate or phosphorodithioate); or an alphamethylphosphonate, a beta methylphosphonate, and/or a gammamethylphosphonate. Other alkynyl groups are contemplated by thetechnology and find use in the technology, e.g., butynyl, etc. In someembodiments, the nucleotide analog is as described in other sectionsherein.

Alternatively, in embodiments related to the use of a linear template(see, e.g., FIG. 2C), a Sanger fragment-like DNA ladder is generated bya polymerase reaction using a primer complementary to a sequence in theadaptor and a mix of dNTPs and 3′-O-blocked dNTP analogs as describedherein (see, e.g., FIG. 2C, step iii). In some embodiments, the3′-O-blocked dNTP analog is a 3′-O-alkynyl nucleotide analog (e.g., analkyl having a saturated position (sp³-hybridized) on a molecularframework next to an alkynyl group, and substituted variants thereof).In some embodiments, the 3′-O-blocked dNTP analog is a 3′-O-propargylnucleotide analog having a structure as shown below:

where B is the base of the nucleotide (e.g., adenine, guanine, thymine,cytosine, or a natural or synthetic nucleobase, e.g., a modified purinesuch as hypoxanthine, xanthine, 7-methylguanine; a modified pyrimidinesuch as 5,6-dihydrouracil, 5-methylcytosine, 5-hydroxymethylcytosine;etc.) and P comprises a phosphate moiety. In some embodiments, Pcomprises a tetraphosphate; a triphosphate; a diphosphate; amonophosphate; a 5′ hydroxyl; an alpha thiophosphate (e.g.,phosphorothioate or phosphorodithioate), a beta thiophosphate (e.g.,phosphorothioate or phosphorodithioate), and/or a gamma thiophosphate(e.g., phosphorothioate or phosphorodithioate); or an alphamethylphosphonate, a beta methylphosphonate, and/or a gammamethylphosphonate. Other alkynyl groups are contemplated by thetechnology and find use in the technology, e.g., butynyl, etc. In someembodiments, the nucleotide analog is as described in other sectionsherein.

Embodiments of the technology provide advantages over existingtechnologies. For example, in some embodiments the technology provideshigh quality sequence from a small amount of input nucleic acid (e.g.,less than 10 ng of nucleic acid, e.g., less than 10 ng of genomic DNA).The technology provides for the robust tagging of individual templates.Production of libraries is efficient because the methods comprise fewmanipulations (and thus few clean-up steps) and each of themanipulations has a sufficient yield.

In some embodiments, the nucleotide analog comprises a reversibleterminator that comprises a blocking group that can be removed tounblock the nucleotide. In some embodiments, the nucleotide analogcomprises a functional terminator, e.g., that provides a particulardesired reactivity for subsequent steps.

The nucleotide analogs result in the production of a fragment ladderhaving fragments over a range of sizes. For example, in someembodiments, the fragments have lengths ranging from approximately 10 toapproximately 50 bp, approximately 10 to approximately 100 bp, and up toapproximately 100 bp to approximately 700 or approximately 800 bp ormore bp; furthermore, in some embodiments lengths greater than 1000 bpare achieved by adjusting the ratio of dNTPs and 3′-O-blocked dNTPanalogs in the reaction mixture (e.g., using a ratio of from 1:500 to500:1 (e.g., 1:500, 1:450, 1:400, 1:350, 1:300, 1:250, 1:200, 1:150,1:100, 1:90, 1:80, 1:70, 1:60, 1:50, 1:40, 1:30, 1:20, 1:10, 1:9, 1:8,1:7, 1:6, 1:5, 1:4, 1:3, 1:2, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1,10:1, 20:1, 30:1, 40:1, 50:1, 60:1, 70:1, 80:1, 90:1, 100:1, 150:1,200:1, 250:1, 300:1, 350:1, 400:1, 450:1, or 500:1).

Conventional dideoxynucleotide (ddNTP) sequencing technologies (e.g.,Sanger-type sequencing chemistries) are not appropriate for this step inthese embodiments because the lack of a 3′ —OH group in the terminatingddNTP creates a non-reactive terminal 3′ end that cannot accept theligation of the second adaptor oligonucleotide in the subsequent step.

Once the nucleic acid fragment ladder is generated with reactive (e.g.,ligatable) 3′ ends, a second adaptor oligonucleotide comprising auniversal sequence (e.g., universal sequence C) is ligated(enzymatically or chemically) to the 3′ ends of the fragments of thenucleic acid fragment ladder to produce a NGS library. (see, e.g., FIG.2A, step v; FIG. 2C, step (iv)). In some embodiments, limited cycle PCRor another amplification method is performed to amplify the finalproduct.

In some embodiments, the methods find use in acquiring short sequences,e.g., of ˜120-200 bp. Such embodiments find use, e.g., in assessingcancer genes, e.g., to assess mutations of a cancer panel. In someembodiments, the technology finds use in acquiring sequences of 500 bp,1000 bp, or more. For example, in some embodiments, a target nucleicacid is amplified using one or more target specific primers (see, e.g.,FIG. 2B, step i; FIG. 2C, step (i)). The target nucleic acid may be aDNA or an RNA, e.g., a genomic DNA; mRNA; a cosmid, fosmid, or bacterialartificial chromosome (e.g., comprising an insert), a gene, a plasmid,etc. In some embodiments, an RNA is first reverse transcribed to producea DNA. Amplification may be PCR, limited cycle PCR, isothermal PCR,amplification with Phi29 or Bst enzymes, etc., e.g., as shown in FIG. 2Band in FIG. 2C.

In some embodiments, the target specific primers include both auniversal sequence (e.g., universal sequence A) and a uniquelyidentifying index sequence (e.g., a barcode sequence; see FIG. 2B,“NNNNN” barcode sequence) that allows tracking and/or identifying thetarget nucleic acid from which the amplified product (amplicon) wasproduced. Generally, barcode sequences may consist of 1 to 10 or morenucleotides. For example, a 10-base barcode sequence provides 1,048,576(4¹⁰) combinations of uniquely identifiable target-specific primermolecules. Consequently, with an appropriately designed barcode length,a starting material containing a small to a very large number of targetDNA fragments can be reliably tagged and indexed without duplicatetagging with the same barcode sequence.

In some embodiments, a next step comprises ligating the uniquelybarcoded individual amplicons at their 3′ ends to an adaptoroligonucleotide approximately 10 to 80 bases in length and comprising asecond universal sequence (e.g., universal sequence B) (see, e.g., FIG.2B, step After ligation, the adaptor-amplicon nucleic acids areself-ligated (e.g., circularized) to form a circular template (see,e.g., FIG. 2B, step iii) The circularization brings the universalsequence at the 3′ end adjacent to the barcode sequence at the 5′ end.Intramolecular ligation may be effected using a ligase. For example,CircLigase II (Epicentre) is a thermostable single-stranded DNA ligasethat catalyzes intramolecular ligation of single-stranded DNA templateshaving a 5′ phosphate and a 3′ hydroxyl group.

Using the circularized template, a Sanger fragment-like DNA ladder isgenerated by a polymerase reaction using a primer complementary touniversal sequence A and a mix of dNTPs and 3′-O-blocked dNTP analogs asdescribed herein (see, e.g., FIG. 2B, step iv). In some embodiments, the3′-O-blocked dNTP analog is a 3′-O-alkynyl nucleotide analog (e.g., analkyl, having a saturated position (sp³-hybridized) on a molecularframework next to an alkynyl group, and substituted variants thereof).In some embodiments, the 3′-O-blocked dNTP analog is a 3′-O-propargylnucleotide analog having a structure as shown below:

where B is the base of the nucleotide (e.g., adenine, guanine, thymine,cytosine, or a natural or synthetic nucleobase, e.g., a modified purinesuch as hypoxanthine, xanthine, 7-methylguanine; a modified pyrimidinesuch as 5,6-dihydrouracil, 5-methylcytosine, 5-hydroxymethylcytosine;etc.) and P comprises a phosphate moiety. In some embodiments, Pcomprises a tetraphosphate; a triphosphate; a diphosphate; amonophosphate; a 5′ hydroxyl; an alpha thiophosphate (e.g.,phosphorothioate or phosphorodithioate), a beta thiophosphate (e.g.,phosphorothioate or phosphorodithioate), and/or a gamma thiophosphate(e.g., phosphorothioate or phosphorodithioate); or an alphamethylphosphonate, a beta methylphosphonate, and/or a gammamethylphosphonate. Other alkynyl groups are contemplated by thetechnology and find use in the technology, e.g., butynyl, etc. In someembodiments, the nucleotide analog is as described in other sectionsherein. Other alkynyl groups are contemplated by the technology and finduse in the technology, e.g., butynyl, etc. In some embodiments, thenucleotide analog is as described in other sections herein.

In some embodiments, the nucleotide analog comprises a reversibleterminator that comprises a blocking group that can be removed tounblock the nucleotide. In some embodiments, the nucleotide analogcomprises a functional terminator, e.g., that provides a particulardesired reactivity for subsequent steps. The nucleotide analogs resultin the production of a fragment ladder having fragments over a range ofsizes. For example, in some embodiments, the fragments have lengthsranging from ˜400 bp to ˜700 or 800 bp; furthermore, in someembodiments, sequence lengths greater than 1000 bp to greater than10,000 bp are achieved, e.g., by adjusting the ratio of dNTPs and3′-O-blocked dNTP analogs in the reaction mixture.

Conventional dideoxynucleotide (ddNTP) sequencing technologies (e.g.,Sanger-type sequencing chemistries) are not appropriate for this step inthese embodiments because the lack of a 3′ —OH group in the terminatingddNTP creates a non-reactive terminal 3′ end that cannot accept theligation of the second adaptor oligonucleotide in the subsequent step.

Then, the nucleic acid fragment ladder is circularized to form a nucleicacid circle library (see, e.g., FIG. 2B, step v). After a digestion withone or more restriction enzymes (see, e.g., FIG. 2B, step vi), a secondadaptor oligonucleotide (e.g., comprising a universal sequence, e.g.,universal sequence C) is ligated (enzymatically or chemically) to the 3′ends of the digestion products of the nucleic acid circle library toproduce a NGS library. (see, e.g., FIG. 2B, step vii). In someembodiments, limited cycle PCR or another amplification method isperformed to amplify the final product. Without being limited to anyparticular method or length of time to perform any steps of the methodsprovided, in some embodiments the methods described take from ˜6 (e.g.,˜6.5) hours to ˜9 (e.g., ˜8.5 hours) to complete.

In some embodiments (e.g., embodiments using 3′-O-alkynyl nucleotideanalog terminators such as 3′-O-propargyl nucleotide analogs), thefragments comprise a 3′ alkyne. Then, in some embodiments, the secondadaptor oligonucleotide comprising a universal sequence (e.g., universalsequence C) comprises a 5′ azide (N3) group that is reactable with thefragment 3′ alkyne group. Then, in some embodiments, a “click chemistry”process such as an azide-alkyne cycloaddition is used to link theadaptor to the fragment via formation of a triazole:

where R₁ and R₂ are individually any chemical structure or chemicalmoiety.

In some embodiments, the triazole ring linkage has a structure accordingto:

where R₁ and R₂ are individually any chemical structure or chemicalmoiety (and not necessarily the same from structure to structure) and B,B₁, and B₂ individually indicate the base of the nucleotide (e.g.,adenine, guanine, thymine, cytosine, or a natural or syntheticnucleobase, e.g., a modified purine such as hypoxanthine, xanthine,7-methylguanine; a modified pyrimidine such as 5,6-dihydrouracil,5-methylcytosine, 5-hydroxymethylcytosine; etc.).

The triazole ring linkage formed by the alkyne-azide cycloaddition hassimilar characteristics (e.g., physical, biological, chemicalcharacteristics) as a natural phosphodiester bond present in nucleicacids and therefore is a nucleic acid backbone mimic. Consequently,conventional enzymes that recognize natural nucleic acids as substratesalso recognize as substrates the products formed by alkyne-azidecycloaddition as provided by the technology described herein. See, e.g.,El-Sagheer, et al. (2011) “Biocompatible artificial DNA linker that isread through by DNA polymerases and is functional in Escherichia coliProc Natl Acad Sci USA 108(28): 11338-43, which is incorporated hereinby reference in its entirety).

The final NGS fragment library is then used as the input to a NGS systemfor sequencing. During sequencing, ˜20 to 50 bases of DNA adjacent tothe adaptor comprising universal sequence C are sequenced (correspondingto ˜20 to 50 bases of the target nucleic acid) and the barcode adjacentto the adaptor comprising universal sequence B is sequenced (see, e.g.,FIG. 3). Once the sequences are obtained, the sequence reads are parsedinto bins by the barcode sequences to collect sequence reads thatoriginated from a template molecule tagged with that particular uniquebarcode sequence (see, e.g., FIG. 3). The sequence reads in each bin(for each barcode sequence) are aligned to each other and assembled toconstruct a longer contiguous consensus sequence with phase informationintact. This sequence can be aligned to an appropriate referencesequence for downstream sequence analysis.

Various exemplary nucleic acid sequencing platforms, nucleic acidassembly, and/or nucleic acid mapping systems (e.g., computer softwareand/or hardware) are described, e.g., in U.S. Pat. Appl. Pub. No.2011/0270533, which is incorporated herein by reference. The techniquesof “paired-end”, “mate-pair”, and other assembly-related sequencing aregenerally known in the art of molecular biology (Siegel A. F. et al.,Genomics 2000, 68: 237-246; Roach J. C. et al., Genomics 1995, 26:345-353). These sequencing techniques allow the determination ofmultiple “reads” of sequence, each from a different place on a singlepolynucleotide. Typically, the distance between the reads or otherinformation regarding a relationship between the reads is known. In somesituations, these sequencing techniques provide more information thandoes sequencing multiple stretches of nucleic acid sequences in a randomfashion. With the use of appropriate software tools for the assembly ofsequence information (e.g., Millikin S. C. et al., Genome Res. 2003, 13:81-90; Kent, W. J. et al., Genome Res. 2001, 11: 1541-8) it is possibleto make use of the knowledge that the sequences are not completelyrandom, but are known to occur a known distance apart and/or to havesome other relationship, and are therefore linked in the genome. Thisinformation can aid in the assembly of whole nucleic acid sequences intoa consensus sequence.

2. Nucleotide Analogs

In some embodiments a nucleotide analog finds use as a functionalnucleotide terminator (e.g., in embodiments of compositions, methods,kits, and systems described herein). A functional nucleotide terminatorboth terminates polymerization of a nucleic acid, e.g., by blocking the3′ hydroxyl from participating further in the polymerization reaction,and comprises a functional reactive group that can participate in otherchemical reactions with other chemical moieties and groups.

For example, a nucleotide analog comprising an alkynyl group finds usein some embodiments, e.g., having a structure according to:

wherein B is a base, e.g., adenine, guanine, cytosine, thymine, oruracil, e.g., having a structure according to:

or a modified base or analog of a base, and P comprises a phosphatemoiety, e.g., to provide a nucleotide having a structure according to:

In some embodiments, P comprises a tetraphosphate; a triphosphate; adiphosphate; a monophosphate; a 5′ hydroxyl; an alpha thiophosphate(e.g., phosphorothioate or phosphorodithioate), a beta thiophosphate(e.g., phosphorothioate or phosphorodithioate), and/or a gammathiophosphate (e.g., phosphorothioate or phosphorodithioate); or analpha methylphosphonate, a beta methylphosphonate, and/or a gammamethylphosphonate. In some embodiments, P comprises an azide (e.g., N₃,e.g., N═N═N), thus providing, in some embodiments, a directional,bi-functional polymerization agent. In some embodiments, the technologycomprises use of a nucleotide analog as described in co-pending U.S.patent application Ser. Nos. 14/463,412 and 14/463,416; and Int'l Pat.App. PCT/US2014/051726, each of which is incorporated herein byreference in its entirety.

In some embodiments, the nucleotide analog is a 3′-O-alkynyl nucleotideanalog; in some embodiments the nucleotide analog is a 3′-O-propargylnucleotide analog such as a 3′-O-propargyl dNTP (wherein N=A, C, G, T,or U). A propargyl nucleotide analog is a nucleotide analog comprising abase (e.g., adenine, guanine, cytosine, thymine, or uracil), adeoxyribose, and an alkyne chemical moiety attached to the 3′-oxygen ofthe deoxyribose. Chemical ligation between the polymerase extensionproducts and appropriate conjugation partners (e.g., azide modifiedmolecules) is achieved with high efficiency and specificity using, e.g.,click chemistry.

The 3′ hydroxyl group of the nucleotide analog is capped by a chemicalmoiety, e.g., an alkyne (e.g., a carbon-carbon triple bond), that haltsfurther elongation of the nucleic acid (e.g., DNA, RNA) chain whenincorporated by polymerase (e.g., DNA or RNA polymerase). The alkynechemical moiety is a well-known conjugation partner of an azide (N3)group, e.g., in a copper (I)-catalyzed 1,3-dipolar cycloadditionreaction (e.g., a “click chemistry” reaction). Reaction of the alkynewith the azide forms a five-membered triazole ring, which therebycreates a covalent linkage. The triazole ring linkage, in certainpositional arrangements, has characteristics that are similar to anatural phosphodiester bond as found in a conventional nucleic acidbackbone and therefore the triazole link is a nucleic acid backbonemimic. As provided by some embodiments herein, use of3′-O-propargyl-dNTPs creates nucleic acid fragments that have a terminal3′-O-alkyne group. Accordingly, these nucleic acid fragments can then bechemically ligated using click chemistry to any azide-modifiedmolecules, such as 5′-azide-modified oligonucleotides (e.g., such asadaptors as provided herein or a solid support). The triazole chemicalbond is compatible with typical reactions and enzymes used forbiochemistry and molecular biology and, as such, does not inhibitenzymatic reactions. Accordingly, the chemically ligated nucleic acidfragments can then be used in subsequent enzymatic reactions, such as apolymerase chain reaction, a sequencing reaction, etc.

In some embodiments, the nucleotide analog comprises a reversibleterminator. For example, in a nucleotide analog comprising a reversibleterminator, the 3′ hydroxyl groups are capped with a chemical moietythat can be removed with a specific chemical reaction, thus regeneratinga free 3′ hydroxyl. As such, some embodiments comprise a reaction toremove the reversible terminator and, in some embodiments, an additionalpurification step to remove the free capping (terminator) moiety. Insome embodiments, a nucleotide comprising a reversible terminator is asdescribed in U.S. Pat. App. Ser. No. 61/791,730 and/or in InternationalApplication Number PCT/US14/24391, each incorporated herein by referencein its entirety.

3. Adaptors

Methods of the technology involve attaching an adaptor to a nucleic acid(e.g., an amplicon or a ladder fragment as described herein). In certainembodiments, the adaptors are attached to a nucleic acid with an enzyme.The enzyme may be a ligase or a polymerase. The ligase may be any enzymecapable of ligating an oligonucleotide (single stranded RNA, doublestranded RNA, single stranded DNA, or double stranded DNA) to anothernucleic acid molecule. Suitable ligases include T4 DNA ligase and T4 RNAligase (such ligases are available commercially, e.g., from New EnglandBioLabs). Methods for using ligases are well known in the art. Theligation may be blunt ended or via use of complementary over hangingends. In certain embodiments, the ends of nucleic acids may bephosphorylated (e.g., using T4 polynucleotide kinase), repaired, trimmed(e.g. using an exonuclease), or filled (e.g., using a polymerase anddNTPs), to form blunt ends. Upon generating blunt ends, the ends may betreated with a polymerase and dATP to form a template independentaddition to the 3′ end of the fragments, thus producing a single Aoverhanging. This single A is used to guide ligation of fragments with asingle T overhanging from the 5′ end in a method referred to as T-Acloning. The polymerase may be any enzyme capable of adding nucleotidesto the 3′ and the 5′ terminus of template nucleic acid molecules.

In some embodiments an adaptor comprises a functional moiety forchemical ligation to a nucleotide analog. For example, in someembodiments an adaptor comprises an azide group (e.g., at the 5′ end)that is reactive with an alkynyl group (e.g., a propargyl group, e.g.,at the 3′ end of a nucleic acid comprising the nucleotide analog), e.g.,by a click chemistry reaction (e.g., using a copper-based catalystreagent).

In some embodiments, the adaptors comprise a universal sequence and/oran index, e.g., a barcode nucleotide sequence. Additionally, adaptorscan contain one or more of a variety of sequence elements, including butnot limited to, one or more amplification primer annealing sequences orcomplements thereof, one or more sequencing primer annealing sequencesor complements thereof, one or more barcode sequences, one or morecommon sequences shared among multiple different adaptors or subsets ofdifferent adaptors (e.g., a universal sequence), one or more restrictionenzyme recognition sites, one or more overhangs complementary to one ormore target polynucleotide overhangs, one or more probe binding sites(e.g. for attachment to a sequencing platform, such as a flow cell formassive parallel sequencing, such as developed by Illumina, Inc.), oneor more random or near-random sequences (e.g. one or more nucleotidesselected at random from a set of two or more different nucleotides atone or more positions, with each of the different nucleotides selectedat one or more positions represented in a pool of adaptors comprisingthe random sequence), and combinations thereof. Two or more sequenceelements can be non-adjacent to one another (e.g. separated by one ormore nucleotides), adjacent to one another, partially overlapping, orcompletely overlapping. For example, an amplification primer annealingsequence can also serve as a sequencing primer annealing sequence.Sequence elements can be located at or near the 3′ end, at or near the5′ end, or in the interior of the adaptor oligonucleotide. When anadaptor oligonucleotide is capable of forming secondary structure, suchas a hairpin, sequence elements can be located partially or completelyoutside the secondary structure, partially or completely inside thesecondary structure, or in between sequences participating in thesecondary structure. For example, when an adaptor oligonucleotidecomprises a hairpin structure, sequence elements can be locatedpartially or completely inside or outside the hybridizable sequences(the “stem”), including in the sequence between the hybridizablesequences (the “loop”). In some embodiments, the first adaptoroligonucleotides in a plurality of first adaptor oligonucleotides havingdifferent barcode sequences comprise a sequence element common among allfirst adaptor oligonucleotides in the plurality. In some embodiments,all second adaptor oligonucleotides comprise a sequence element commonamong all second adaptor oligonucleotides that is different from thecommon sequence element shared by the first adaptor oligonucleotides. Adifference in sequence elements can be any such that at least a portionof different adaptors do not completely align, for example, due tochanges in sequence length, deletion or insertion of one or morenucleotides, or a change in the nucleotide composition at one or morenucleotide positions (such as a base change or base modification). Insome embodiments, an adaptor oligonucleotide comprises a 5′ overhang, a3′ overhang, or both that is complementary to one or more targetpolynucleotides. Complementary overhangs can be one or more nucleotidesin length, including but not limited to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, or more nucleotides in length. Complementaryoverhangs may comprise a fixed sequence. Complementary overhangs maycomprise a random sequence of one or more nucleotides, such that one ormore nucleotides are selected at random from a set of two or moredifferent nucleotides at one or more positions, with each of thedifferent nucleotides selected at one or more positions represented in apool of adaptors with complementary overhangs comprising the randomsequence. In some embodiments, an adaptor overhang is complementary to atarget polynucleotide overhang produced by restriction endonucleasedigestion. In some embodiments, an adaptor overhang consists of anadenine or a thymine.

In some embodiments, the adaptor sequences can contain a molecularbinding site identification element to facilitate identification andisolation of the target nucleic acid for downstream applications.Molecular binding as an affinity mechanism allows for the interactionbetween two molecules to result in a stable association complex.Molecules that can participate in molecular binding reactions includeproteins, nucleic acids, carbohydrates, lipids, and small organicmolecules such as ligands, peptides, or drugs.

When a nucleic acid molecular binding site is used as part of theadaptor, it can be used to employ selective hybridization to isolate atarget sequence. Selective hybridization may restrict substantialhybridization to target nucleic acids containing the adaptor with themolecular binding site and capture nucleic acids, which are sufficientlycomplementary to the molecular binding site. Thus, through “selectivehybridization” one can detect the presence of the target polynucleotidein an unpure sample containing a pool of many nucleic acids. An exampleof a nucleotide-nucleotide selective hybridization isolation systemcomprises a system with several capture nucleotides, which arecomplementary sequences to the molecular binding identificationelements, and are optionally immobilized to a solid support. In otherembodiments, the capture polynucleotides could be complementary to thetarget sequences itself or a barcode or unique tag contained within theadaptor. The capture polynucleotides can be immobilized to various solidsupports, such as inside of a well of a plate, mono-dispersed spheres,microarrays, or any other suitable support surface known in the art. Thehybridized complementary adaptor polynucleotides attached on the solidsupport can be isolated by washing away the undesirable non-bindingnucleic acids, leaving the desirable target polynucleotides behind. Ifcomplementary adaptor molecules are fixed to paramagnetic spheres orsimilar bead technology for isolation, then spheres can then be mixed ina tube together with the target polynucleotide containing the adaptors.When the adaptor sequences have been hybridized with the complementarysequences fixed to the spheres, undesirable molecules can be washed awaywhile spheres are kept in the tube with a magnet or similar agent. Thedesired target molecules can be subsequently released by increasing thetemperature, changing the pH, or by using any other suitable elutionmethod known in the art.

4. Barcodes

A barcode is a known nucleic acid sequence that allows some feature of anucleic acid with which the barcode is associated to be identified. Insome embodiments, the feature of the nucleic acid to be identified isthe sample or source from which the nucleic acid is derived. The barcodesequence generally includes certain features that make the sequenceuseful in sequencing reactions. For example, the barcode sequences aredesigned to have minimal or no homopolymer regions, e.g., 2 or more ofthe same base in a row such as AA or CCC, within the barcode sequence.In some embodiments, the barcode sequences are also designed so thatthey are at least one edit distance away from the base addition orderwhen performing base-by-base sequencing, ensuring that the first andlast bases do not match the expected bases of the sequence.

In some embodiments, the barcode sequences are designed such that eachsequence is correlated to a particular target nucleic acid, allowing theshort sequence reads to be correlated back to the target nucleic acidfrom which they came. Methods of designing sets of barcode sequences areshown, for example, in U.S. Pat. No. 6,235,475, the contents of whichare incorporated by reference herein in their entirety. In someembodiments, the barcode sequences range from about 5 nucleotides toabout 15 nucleotides. In a particular embodiment, the barcode sequencesrange from about 4 nucleotides to about 7 nucleotides. Since the barcodesequences are sequenced along with the ladder fragment nucleic acid, inembodiments using longer sequences the barcode length is of a minimallength so as to permit the longest read from the fragment nucleic acidattached to the barcode. In some embodiments, the barcode sequences arespaced from the fragment nucleic acid molecule by at least one base,e.g., to minimize homopolymeric combinations.

In some embodiments, lengths and sequences of barcode sequences aredesigned to achieve a desired level of accuracy of determining theidentity of nucleic acid. For example, in some embodiments barcodesequences are designed such that after a tolerable number of pointmutations, the identity of the associated nucleic acid can still bededuced with a desired accuracy. In some embodiments, a Tn-5 transposase(commercially available from Epicentre Biotechnologies; Madison, Wis.)cuts a nucleic acid into fragments and inserts short pieces of DNA intothe cuts. The short pieces of DNA are used to incorporate the barcodesequences.

Attaching adaptors comprising barcodes to nucleic acid templates isshown in U.S. Pat. Appl. Pub. No. 2008/0081330 and in International Pat.Appl. No. PCT/US09/64001, the content of each of which is incorporatedby reference herein in its entirety. Methods for designing sets ofbarcode sequences and other methods for attaching adaptors (e.g.,comprising barcode sequences) are shown in U.S. Pat. Nos. 6,138,077;6,352,828; 5,636,400; 6,172,214; 6,235,475; 7,393,665; 7,544,473;5,846,719; 5,695,934; 5,604,097; 6,150,516; RE39,793; 7,537,897;6172,218; and 5,863,722, the content of each of which is incorporated byreference herein in its entirety. In certain embodiments, a singlebarcode is attached to each fragment. In other embodiments, a pluralityof barcodes, e.g., two barcodes, is attached to each fragment.

5. Samples

In some embodiments, nucleic acid template molecules (e.g., DNA or RNA)are isolated from a biological sample containing a variety of othercomponents, such as proteins, lipids, and non-template nucleic acids.Nucleic acid template molecules can be obtained from any material (e.g.,cellular material (live or dead), extracellular material, viralmaterial, environmental samples (e.g., metagenomic samples), syntheticmaterial (e.g., amplicons such as provided by PCR or other amplificationtechnologies)), obtained from an animal, plant, bacterium, archaeon,fungus, or any other organism. Biological samples for use in the presentinvention include viral particles or preparations thereof. Nucleic acidtemplate molecules can be obtained directly from an organism or from abiological sample obtained from an organism, e.g., from blood, urine,cerebrospinal fluid, seminal fluid, saliva, sputum, stool, hair, sweat,tears, skin, and tissue. Exemplary samples include, but are not limitedto, whole blood, lymphatic fluid, serum, plasma, buccal cells, sweat,tears, saliva, sputum, hair, skin, biopsy, cerebrospinal fluid (CSF),amniotic fluid, seminal fluid, vaginal excretions, serous fluid,synovial fluid, pericardial fluid, peritoneal fluid, pleural fluid,transudates, exudates, cystic fluid, bile, urine, gastric fluids,intestinal fluids, fecal samples, and swabs, aspirates (e.g., bonemarrow, fine needle, etc.), washes (e.g., oral, nasopharyngeal,bronchial, bronchialalveolar, optic, rectal, intestinal, vaginal,epidermal, etc.), and/or other specimens.

Any tissue or body fluid specimen may be used as a source for nucleicacid for use in the technology, including forensic specimens, archivedspecimens, preserved specimens, and/or specimens stored for long periodsof time, e.g., fresh-frozen, methanol/acetic acid fixed, orformalin-fixed paraffin embedded (FFPE) specimens and samples. Nucleicacid template molecules can also be isolated from cultured cells, suchas a primary cell culture or a cell line. The cells or tissues fromwhich template nucleic acids are obtained can be infected with a virusor other intracellular pathogen. A sample can also be total RNAextracted from a biological specimen, a cDNA library, viral, or genomicDNA. A sample may also be isolated DNA from a non-cellular origin, e.g.amplified/isolated DNA that has been stored in a freezer.

Nucleic acid template molecules can be obtained, e.g., by extractionfrom a biological sample, e.g., by a variety of techniques such as thosedescribed by Maniatis, et al. (1982) Molecular Cloning: A LaboratoryManual, Cold Spring Harbor, N.Y. (see, e.g., pp. 280-281).

In some embodiments, size selection of the nucleic acids is performed toremove very short fragments or very long fragments. Suitable methodsselect a size are known in the art. In various embodiments, the size islimited to be 0.5, 1, 2, 3, 4, 5, 7, 10, 12, 15, 20, 25, 30, 50, 100 kbor longer.

In various embodiments, a nucleic acid is amplified. Any amplificationmethod known in the art may be used. Examples of amplificationtechniques that can be used include, but are not limited to, PCR,quantitative PCR, quantitative fluorescent PCR (QF-PCR), multiplexfluorescent PCR (MF-PCR), real time PCR (RT-PCR), single cell PCR,restriction fragment length polymorphism PCR (PCR-RFLP), hot start PCR,nested PCR, in situ polony PCR, in situ rolling circle amplification(RCA), bridge PCR, picotiter PCR, and emulsion PCR. Other suitableamplification methods include the ligase chain reaction (LCR),transcription amplification, self-sustained sequence replication,selective amplification of target polynucleotide sequences, consensussequence primed polymerase chain reaction (CP-PCR), arbitrarily primedpolymerase chain reaction (AP-PCR), degenerate oligonucleotide-primedPCR (DOP-PCR), and nucleic acid based sequence amplification (NABSA).Other amplification methods that can be used herein include thosedescribed in U.S. Pat. Nos. 5,242,794; 5,494,810; 4,988,617; and6,582,938.

In some embodiments, end repair is performed to generate blunt end 5′phosphorylated nucleic acid ends using commercial kits, such as thoseavailable from Epicentre Biotechnologies (Madison, Wis.).

6. Nucleic Acid Sequencing

In some embodiments of the technology, nucleic acid sequence data aregenerated. Various embodiments of nucleic acid sequencing platforms(e.g., a nucleic acid sequencer) include components as described below.According to various embodiments, a sequencing instrument includes afluidic delivery and control unit, a sample processing unit, a signaldetection unit, and a data acquisition, analysis and control unit.Various embodiments of the instrument provide for automated sequencingthat is used to gather sequence information from a plurality ofsequences in parallel and/or substantially simultaneously.

In some embodiments, the fluidics delivery and control unit includes areagent delivery system. The reagent delivery system includes a reagentreservoir for the storage of various reagents. The reagents can includeRNA-based primers, forward/reverse DNA primers, nucleotide mixtures(e.g., compositions comprising nucleotide analogs as provided herein)for sequencing-by-synthesis, buffers, wash reagents, blocking reagents,stripping reagents, and the like. Additionally, the reagent deliverysystem can include a pipetting system or a continuous flow system thatconnects the sample processing unit with the reagent reservoir.

In some embodiments, the sample processing unit includes a samplechamber, such as flow cell, a substrate, a micro-array, a multi-welltray, or the like. The sample processing unit can include multiplelanes, multiple channels, multiple wells, or other means of processingmultiple sample sets substantially simultaneously. Additionally, thesample processing unit can include multiple sample chambers to enableprocessing of multiple runs simultaneously. In particular embodiments,the system can perform signal detection on one sample chamber whilesubstantially simultaneously processing another sample chamber.Additionally, the sample processing unit can include an automationsystem for moving or manipulating the sample chamber. In someembodiments, the signal detection unit can include an imaging ordetection sensor. For example, the imaging or detection sensor (e.g., afluorescence detector or an electrical detector) can include a CCD, aCMOS, an ion sensor, such as an ion sensitive layer overlying a CMOS, acurrent detector, or the like. The signal detection unit can include anexcitation system to cause a probe, such as a fluorescent dye, to emit asignal. The detection system can include an illumination source, such asarc lamp, a laser, a light emitting diode (LED), or the like. Inparticular embodiments, the signal detection unit includes optics forthe transmission of light from an illumination source to the sample orfrom the sample to the imaging or detection sensor. Alternatively, thesignal detection unit may not include an illumination source, such asfor example, when a signal is produced spontaneously as a result of asequencing reaction. For example, a signal can be produced by theinteraction of a released moiety, such as a released ion interactingwith an ion sensitive layer, or a pyrophosphate reacting with an enzymeor other catalyst to produce a chemiluminescent signal. In anotherexample, changes in an electrical current, voltage, or resistance aredetected without the need for an illumination source.

In some embodiments, a data acquisition analysis and control unitmonitors various system parameters. The system parameters can includetemperature of various portions of the instrument, such as sampleprocessing unit or reagent reservoirs, volumes of various reagents, thestatus of various system subcomponents, such as a manipulator, a steppermotor, a pump, or the like, or any combination thereof.

It will be appreciated by one skilled in the art that variousembodiments of the instruments and systems are used to practicesequencing methods such as sequencing by synthesis, single moleculemethods, and other sequencing techniques. Sequencing by synthesis caninclude the incorporation of dye labeled nucleotides, chain termination,ion/proton sequencing, pyrophosphate sequencing, or the like. Singlemolecule techniques can include staggered sequencing, where thesequencing reactions is paused to determine the identity of theincorporated nucleotide.

In some embodiments, the sequencing instrument determines the sequenceof a nucleic acid, such as a polynucleotide or an oligonucleotide. Thenucleic acid can include DNA or RNA, and can be single stranded, such asssDNA and RNA, or double stranded, such as dsDNA or a RNA/cDNA pair. Insome embodiments, the nucleic acid can include or be derived from afragment library, a mate pair library, a ChIP fragment, or the like. Inparticular embodiments, the sequencing instrument can obtain thesequence information from a single nucleic acid molecule or from a groupof substantially identical nucleic acid molecules.

In some embodiments, the sequencing instrument can output nucleic acidsequencing read data in a variety of different output data filetypes/formats, including, but not limited to: *.txt, *.fasta, *.csfasta,*seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs, and/or *.qv.

7. Next-Generation Sequencing Technologies

Particular sequencing technologies contemplated by the technology arenext-generation sequencing (NGS) methods that share the common featureof massively parallel, high-throughput strategies, with the goal oflower costs in comparison to older sequencing methods (see, e.g.,Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al.,Nature Rev. Microbiol., 7: 287-296; each herein incorporated byreference in their entirety). NGS methods can be broadly divided intothose that typically use template amplification and those that do not.Amplification-requiring methods include pyrosequencing commercialized byRoche as the 454 technology platforms (e.g., GS 20 and GS FLX), theSolexa platform commercialized by Illumina, and the SupportedOligonucleotide Ligation and Detection (SOLiD) platform commercializedby Applied Biosystems. Non-amplification approaches, also known assingle-molecule sequencing, are exemplified by the HeliScope platformcommercialized by Helicos BioSciences, and emerging platformscommercialized by VisiGen, Oxford Nanopore Technologies Ltd., LifeTechnologies/Ion Torrent, and Pacific Biosciences, respectively.

In pyrosequencing (Voelkerding et al., Clinical Chem., 55: 641-658,2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. Nos.6,210,891; 6,258,568; each herein incorporated by reference in itsentirety), the NGS fragment library is clonally amplified in-situ bycapturing single template molecules with beads bearing oligonucleotidescomplementary to the adaptors. Each bead bearing a single template typeis compartmentalized into a water-in-oil microvesicle, and the templateis clonally amplified using a technique referred to as emulsion PCR. Theemulsion is disrupted after amplification and beads are deposited intoindividual wells of a picotitre plate functioning as a flow cell duringthe sequencing reactions. Ordered, iterative introduction of each of thefour dNTP reagents occurs in the flow cell in the presence of sequencingenzymes and luminescent reporter such as luciferase. In the event thatan appropriate dNTP is added to the 3′ end of the sequencing primer, theresulting production of ATP causes a burst of luminescence within thewell, which is recorded using a CCD camera. It is possible to achieveread lengths greater than or equal to 400 bases, and 10⁶ sequence readscan be achieved, resulting in up to 500 million base pairs (Mb) ofsequence.

In the Solexa/Illumina platform (Voelkerding et al., Clinical Chem., 55:641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S.Pat. Nos. 6,833,246; 7,115,400; 6,969,488; each herein incorporated byreference in its entirety), sequencing data are produced in the form ofshorter-length reads. In this method, the fragments of the NGS fragmentlibrary are captured on the surface of a flow cell that is studded witholigonucleotide anchors. The anchor is used as a PCR primer, but becauseof the length of the template and its proximity to other nearby anchoroligonucleotides, extension by PCR results in the “arching over” of themolecule to hybridize with an adjacent anchor oligonucleotide to form abridge structure on the surface of the flow cell. These loops of DNA aredenatured and cleaved. Forward strands are then sequenced withreversible dye terminators. The sequence of incorporated nucleotides isdetermined by detection of post-incorporation fluorescence, with eachfluor and block removed prior to the next cycle of dNTP addition.Sequence read length ranges from 36 nucleotides to over 100 nucleotides,with overall output exceeding 1 billion nucleotide pairs per analyticalrun.

Sequencing nucleic acid molecules using SOLiD technology (Voelkerding etal., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev.Microbiol., 7: 287-296; U.S. Pat. Nos. 5,912,148; 6,130,073; each hereinincorporated by reference in their entirety) also involves clonalamplification of the NGS fragment library by emulsion PCR. Followingthis, beads bearing template are immobilized on a derivatized surface ofa glass flow-cell, and a primer complementary to the adaptoroligonucleotide is annealed. However, rather than utilizing this primerfor 3′ extension, it is instead used to provide a 5′ phosphate group forligation to interrogation probes containing two probe-specific basesfollowed by 6 degenerate bases and one of four fluorescent labels. Inthe SOLiD system, interrogation probes have 16 possible combinations ofthe two bases at the 3′ end of each probe, and one of four fluors at the5′ end. Fluor color, and thus identity of each probe, corresponds tospecified color-space coding schemes. Multiple rounds (usually 7) ofprobe annealing, ligation, and fluor detection are followed bydenaturation, and then a second round of sequencing using a primer thatis offset by one base relative to the initial primer. In this manner,the template sequence can be computationally re-constructed, andtemplate bases are interrogated twice, resulting in increased accuracy.Sequence read length averages 35 nucleotides, and overall output exceeds4 billion bases per sequencing run.

In certain embodiments, HeliScope by Helicos BioSciences is employed(Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al.,Nature Rev. Microbiol., 7: 287-296; U.S. Pat. Nos. 7,169,560; 7,282,337;7,482,120; 7,501,245; 6,818,395; 6,911,345; 7,501,245; each hereinincorporated by reference in their entirety). Sequencing is achieved byaddition of polymerase and serial addition of fluorescently-labeled dNTPreagents. Incorporation events result in a fluor signal corresponding tothe dNTP, and signal is captured by a CCD camera before each round ofdNTP addition. Sequence read length ranges from 25-50 nucleotides, withoverall output exceeding 1 billion nucleotide pairs per analytical run.

In some embodiments, 454 sequencing by Roche is used (Margulies et al.(2005) Nature 437: 376-380). 454 sequencing involves two steps. In thefirst step, DNA is sheared into fragments of approximately 300-800 basepairs and the fragments are blunt ended. Oligonucleotide adaptors arethen ligated to the ends of the fragments. The adaptors serve as primersfor amplification and sequencing of the fragments. The fragments can beattached to DNA capture beads, e.g., streptavidin-coated beads using,e.g., an adaptor that contains a 5′-biotin tag. The fragments attachedto the beads are PCR amplified within droplets of an oil-water emulsion.The result is multiple copies of clonally amplified DNA fragments oneach bead. In the second step, the beads are captured in wells(pico-liter sized). Pyrosequencing is performed on each DNA fragment inparallel. Addition of one or more nucleotides generates a light signalthat is recorded by a CCD camera in a sequencing instrument. The signalstrength is proportional to the number of nucleotides incorporated.Pyrosequencing makes use of pyrophosphate (PPi) which is released uponnucleotide addition. PPi is converted to ATP by ATP sulfurylase in thepresence of adenosine 5′ phosphosulfate. Luciferase uses ATP to convertluciferin to oxyluciferin, and this reaction generates light that isdetected and analyzed.

The Ion Torrent technology is a method of DNA sequencing based on thedetection of hydrogen ions that are released during the polymerizationof DNA (see, e.g., Science 327(5970): 1190 (2010); U.S. Pat. Appl. Pub.Nos. 20090026082, 20090127589, 20100301398, 20100197507, 20100188073,and 20100137143, incorporated by reference in their entireties for allpurposes). A microwell contains a fragment of the NGS fragment libraryto be sequenced. Beneath the layer of microwells is a hypersensitiveISFET ion sensor. All layers are contained within a CMOS semiconductorchip, similar to that used in the electronics industry. When a dNTP isincorporated into the growing complementary strand a hydrogen ion isreleased, which triggers a hypersensitive ion sensor. If homopolymerrepeats are present in the template sequence, multiple dNTP moleculeswill be incorporated in a single cycle. This leads to a correspondingnumber of released hydrogens and a proportionally higher electronicsignal. This technology differs from other sequencing technologies inthat no modified nucleotides or optics are used. The per-base accuracyof the Ion Torrent sequencer is ˜99.6% for 50 base reads, with ˜400 Mbgenerated per run. The read-length is 100 base pairs. The accuracy forhomopolymer repeats of 5 repeats in length is ˜98%. The benefits of ionsemiconductor sequencing are rapid sequencing speed and low upfront andoperating costs. However, the cost of acquiring a pH-mediated sequenceris approximately $50,000, excluding sample preparation equipment and aserver for data analysis.

Another exemplary nucleic acid sequencing approach that may be adaptedfor use with the present invention was developed by Stratos Genomics,Inc. and involves the use of Xpandomers. This sequencing processtypically includes providing a daughter strand produced by atemplate-directed synthesis. The daughter strand generally includes aplurality of subunits coupled in a sequence corresponding to acontiguous nucleotide sequence of all or a portion of a target nucleicacid in which the individual subunits comprise a tether, at least oneprobe or nucleobase residue, and at least one selectively cleavablebond. The selectively cleavable bond(s) is/are cleaved to yield anXpandomer of a length longer than the plurality of the subunits of thedaughter strand. The Xpandomer typically includes the tethers andreporter elements for parsing genetic information in a sequencecorresponding to the contiguous nucleotide sequence of all or a portionof the target nucleic acid. Reporter elements of the Xpandomer are thendetected. Additional details relating to Xpandomer-based approaches aredescribed in, for example, U.S. Pat. Pub No. 20090035777, entitled “HIGHTHROUGHPUT NUCLEIC ACID SEQUENCING BY EXPANSION,” filed Jun. 19, 2008,which is incorporated herein in its entirety.

Other single molecule sequencing methods include real-time sequencing bysynthesis using a VisiGen platform (Voelkerding et al., Clinical Chem.,55: 641-58, 2009; U.S. Pat. No. 7,329,492; U.S. patent application Ser.No. 11/671,956; U.S. patent application Ser. No. 11/781,166; each hereinincorporated by reference in their entirety) in which fragments of theNGS fragment library are immobilized, primed, then subjected to strandextension using a fluorescently-modified polymerase and florescentacceptor molecules, resulting in detectible fluorescence resonanceenergy transfer (FRET) upon nucleotide addition.

Another real-time single molecule sequencing system developed by PacificBiosciences (Voelkerding et al., Clinical Chem., 55: 641-658, 2009;MacLean et al., Nature Rev. Microbiol., 7: 287-296; U.S. Pat. Nos.7,170,050; 7,302,146; 7,313,308; 7,476,503; all of which are hereinincorporated by reference) utilizes reaction wells 50-100 nm in diameterand encompassing a reaction volume of approximately 20 zeptoliters(10⁻²¹ l). Sequencing reactions are performed using immobilizedtemplate, modified phi29 DNA polymerase, and high local concentrationsof fluorescently labeled dNTPs. High local concentrations and continuousreaction conditions allow incorporation events to be captured in realtime by fluor signal detection using laser excitation, an opticalwaveguide, and a CCD camera.

In certain embodiments, the single molecule real time (SMRT) DNAsequencing methods using zero-mode waveguides (ZMWs) developed byPacific Biosciences, or similar methods, are employed. With thistechnology, DNA sequencing is performed on SMRT chips, each containingthousands of zero-mode waveguides (ZMWs). A ZMW is a hole, tens ofnanometers in diameter, fabricated in a 100 nm metal film deposited on asilicon dioxide substrate. Each ZMW becomes a nanophotonic visualizationchamber providing a detection volume of just 20 zeptoliters (10⁻²¹ l).At this volume, the activity of a single molecule can be detectedamongst a background of thousands of labeled nucleotides. The ZMWprovides a window for watching DNA polymerase as it performs sequencingby synthesis. Within each chamber, a single DNA polymerase molecule isattached to the bottom surface such that it permanently resides withinthe detection volume. Phospholinked nucleotides, each type labeled witha different colored fluorophore, are then introduced into the reactionsolution at high concentrations which promote enzyme speed, accuracy,and processivity. Due to the small size of the ZMW, even at these high,biologically relevant concentrations, the detection volume is occupiedby nucleotides only a small fraction of the time. In addition, visits tothe detection volume are fast, lasting only a few microseconds, due tothe very small distance that diffusion has to carry the nucleotides. Theresult is a very low background.

In some embodiments, nanopore sequencing is used (Soni G V and Meller A.(2007) Clin Chem 53: 1996-2001). A nanopore is a small hole, of theorder of 1 nanometer in diameter. Immersion of a nanopore in aconducting fluid and application of a potential across it results in aslight electrical current due to conduction of ions through thenanopore. The amount of current which flows is sensitive to the size ofthe nanopore. As a DNA molecule passes through a nanopore, eachnucleotide on the DNA molecule obstructs the nanopore to a differentdegree. Thus, the change in the current passing through the nanopore asthe DNA molecule passes through the nanopore represents a reading of theDNA sequence.

In some embodiments, a sequencing technique uses a chemical-sensitivefield effect transistor (chemFET) array to sequence DNA (for example, asdescribed in US Patent Application Publication No. 20090026082). In oneexample of the technique, DNA molecules are placed into reactionchambers, and the template molecules are hybridized to a sequencingprimer bound to a polymerase. Incorporation of one or more triphosphatesinto a new nucleic acid strand at the 3′ end of the sequencing primercan be detected by a change in current by a chemFET. An array can havemultiple chemFET sensors. In another example, single nucleic acids canbe attached to beads, and the nucleic acids can be amplified on thebead, and the individual beads can be transferred to individual reactionchambers on a chemFET array, with each chamber having a chemFET sensor,and the nucleic acids can be sequenced.

In some embodiments, sequencing technique uses an electron microscope(Moudrianakis E. N. and Beer M. Proc Natl Acad Sci USA. 1965 March;53:564-71). In one example of the technique, individual DNA moleculesare labeled using metallic labels that are distinguishable using anelectron microscope. These molecules are then stretched on a flatsurface and imaged using an electron microscope to measure sequences.

In some embodiments, “four-color sequencing by synthesis using cleavablefluorescents nucleotide reversible terminators” as described in Turro,et al. PNAS 103: 19635-40 (2006) is used, e.g., as commercialized byIntelligent Bio-Systems. The technology described in U.S. Pat. Appl.Pub. Nos. 2010/0323350, 2010/0063743, 2010/0159531, 20100035253,20100152050, incorporated herein by reference for all purposes.

Processes and systems for such real time sequencing that may be adaptedfor use with the invention are described in, for example, U.S. Pat. No.7,405,281, entitled “Fluorescent nucleotide analogs and uses therefor”,issued Jul. 29, 2008 to Xu et al.; U.S. Pat. No. 7,315,019, entitled“Arrays of optical confinements and uses thereof”, issued Jan. 1, 2008to Turner et al.; U.S. Pat. No. 7,313,308, entitled “Optical analysis ofmolecules”, issued Dec. 25, 2007 to Turner et al.; U.S. Pat. No.7,302,146, entitled “Apparatus and method for analysis of molecules”,issued Nov. 27, 2007 to Turner et al.; and U.S. Pat. No. 7,170,050,entitled “Apparatus and methods for optical analysis of molecules”,issued Jan. 30, 2007 to Turner et al.; and U.S. Pat. Pub. Nos.20080212960, entitled “Methods and systems for simultaneous real-timemonitoring of optical signals from multiple sources”, filed Oct. 26,2007 by Lundquist et al.; 20080206764, entitled “Flowcell system forsingle molecule detection”, filed Oct. 26, 2007 by Williams et al.;20080199932, entitled “Active surface coupled polymerases”, filed Oct.26, 2007 by Hanzel et al.; 20080199874, entitled “CONTROLLABLE STRANDSCISSION OF MINI CIRCLE DNA”, filed Feb. 11, 2008 by Otto et al.;20080176769, entitled “Articles having localized molecules disposedthereon and methods of producing same”, filed Oct. 26, 2007 by Rank etal.; 20080176316, entitled “Mitigation of photodamage in analyticalreactions”, filed Oct. 31, 2007 by Eid et al.; 20080176241, entitled“Mitigation of photodamage in analytical reactions”, filed Oct. 31, 2007by Eid et al.; 20080165346, entitled “Methods and systems forsimultaneous real-time monitoring of optical signals from multiplesources”, filed Oct. 26, 2007 by Lundquist et al.; 20080160531, entitled“Uniform surfaces for hybrid material substrates and methods for makingand using same”, filed Oct. 31, 2007 by Korlach; 20080157005, entitled“Methods and systems for simultaneous real-time monitoring of opticalsignals from multiple sources”, filed Oct. 26, 2007 by Lundquist et al.;20080153100, entitled “Articles having localized molecules disposedthereon and methods of producing same”, filed Oct. 31, 2007 by Rank etal.; 20080153095, entitled “CHARGE SWITCH NUCLEOTIDES”, filed Oct. 26,2007 by Williams et al.; 20080152281, entitled “Substrates, systems andmethods for analyzing materials”, filed Oct. 31, 2007 by Lundquist etal.; 20080152280, entitled “Substrates, systems and methods foranalyzing materials”, filed Oct. 31, 2007 by Lundquist et al.;20080145278, entitled “Uniform surfaces for hybrid material substratesand methods for making and using same”, filed Oct. 31, 2007 by Korlach;20080128627, entitled “SUBSTRATES, SYSTEMS AND METHODS FOR ANALYZINGMATERIALS”, filed Aug. 31, 2007 by Lundquist et al.; 20080108082,entitled “Polymerase enzymes and reagents for enhanced nucleic acidsequencing”, filed Oct. 22, 2007 by Rank et al.; 20080095488, entitled“SUBSTRATES FOR PERFORMING ANALYTICAL REACTIONS”, filed Jun. 11, 2007 byFoquet et al.; 20080080059, entitled “MODULAR OPTICAL COMPONENTS ANDSYSTEMS INCORPORATING SAME”, filed Sep. 27, 2007 by Dixon et al.;20080050747, entitled “Articles having localized molecules disposedthereon and methods of producing and using same”, filed Aug. 14, 2007 byKorlach et al.; 20080032301, entitled “Articles having localizedmolecules disposed thereon and methods of producing same”, filed Mar.29, 2007 by Rank et al.; 20080030628, entitled “Methods and systems forsimultaneous real-time monitoring of optical signals from multiplesources”, filed Feb. 9, 2007 by Lundquist et al.; 20080009007, entitled“CONTROLLED INITIATION OF PRIMER EXTENSION”, filed Jun. 15, 2007 by Lyleet al.; 20070238679, entitled “Articles having localized moleculesdisposed thereon and methods of producing same”, filed Mar. 30, 2006 byRank et al.; 20070231804, entitled “Methods, systems and compositionsfor monitoring enzyme activity and applications thereof”, filed Mar. 31,2006 by Korlach et al.; 20070206187, entitled “Methods and systems forsimultaneous real-time monitoring of optical signals from multiplesources”, filed Feb. 9, 2007 by Lundquist et al.; 20070196846, entitled“Polymerases for nucleotide analog incorporation”, filed Dec. 21, 2006by Hanzel et al.; 20070188750, entitled “Methods and systems forsimultaneous real-time monitoring of optical signals from multiplesources”, filed Jul. 7, 2006 by Lundquist et al.; 20070161017, entitled“MITIGATION OF PHOTODAMAGE IN ANALYTICAL REACTIONS”, filed Dec. 1, 2006by Eid et al.; 20070141598, entitled “Nucleotide Compositions and UsesThereof”, filed Nov. 3, 2006 by Turner et al.; 20070134128, entitled“Uniform surfaces for hybrid material substrate and methods for makingand using same”, filed Nov. 27, 2006 by Korlach; 20070128133, entitled“Mitigation of photodamage in analytical reactions”, filed Dec. 2, 2005by Eid et al.; 20070077564, entitled “Reactive surfaces, substrates andmethods of producing same”, filed Sep. 30, 2005 by Roitman et al.;20070072196, entitled “Fluorescent nucleotide analogs and usestherefore”, filed Sep. 29, 2005 by Xu et al; and 20070036511, entitled“Methods and systems for monitoring multiple optical signals from asingle source”, filed Aug. 11, 2005 by Lundquist et al.; and Korlach etal. (2008) “Selective aluminum passivation for targeted immobilizationof single DNA polymerase molecules in zero-mode waveguidenanostructures” PNAS 105(4): 1176-81, all of which are hereinincorporated by reference in their entireties.

8. Nucleic Acid Sequence Analysis

In some embodiments, a computer-based analysis program is used totranslate the raw data generated by the detection assay (e.g.,sequencing reads) into data of predictive value for an end user (e.g.,medical personnel). The user can access the predictive data using anysuitable means. Thus, in some preferred embodiments, the presenttechnology provides the further benefit that the user, who is not likelyto be trained in genetics or molecular biology, need not understand theraw data. The data is presented directly to the end user in its mostuseful form. The user is then able to immediately utilize theinformation to determine useful information (e.g., in medicaldiagnostics, research, or screening).

Some embodiments provide a system for reconstructing a nucleic acidsequence. The system can include a nucleic acid sequencer, a samplesequence data storage, a reference sequence data storage, and ananalytics computing device/server/node. In some embodiments, theanalytics computing device/server/node can be a workstation, mainframecomputer, personal computer, mobile device, etc. The nucleic acidsequencer can be configured to analyze (e.g., interrogate) a nucleicacid fragment (e.g., single fragment, mate-pair fragment, paired-endfragment, etc.) utilizing all available varieties of techniques,platforms or technologies to obtain nucleic acid sequence information,in particular the methods as described herein using compositionsprovided herein. In some embodiments, the nucleic acid sequencer is incommunications with the sample sequence data storage either directly viaa data cable (e.g., serial cable, direct cable connection, etc.) or buslinkage or, alternatively, through a network connection (e.g., Internet,LAN, WAN, VPN, etc.). In some embodiments, the network connection can bea “hardwired” physical connection. For example, the nucleic acidsequencer can be communicatively connected (via Category 5 (CATS), fiberoptic or equivalent cabling) to a data server that is communicativelyconnected (via CATS, fiber optic, or equivalent cabling) through theInternet and to the sample sequence data storage. In some embodiments,the network connection is a wireless network connection (e.g., Wi-Fi,WLAN, etc.), for example, utilizing an 802.11 a/b/g/n or equivalenttransmission format. In practice, the network connection utilized isdependent upon the particular requirements of the system. In someembodiments, the sample sequence data storage is an integrated part ofthe nucleic acid sequencer.

In some embodiments, the sample sequence data storage is any databasestorage device, system, or implementation (e.g., data storage partition,etc.) that is configured to organize and store nucleic acid sequenceread data generated by nucleic acid sequencer such that the data can besearched and retrieved manually (e.g., by a database administrator orclient operator) or automatically by way of a computer program,application, or software script. In some embodiments, the reference datastorage can be any database device, storage system, or implementation(e.g., data storage partition, etc.) that is configured to organize andstore reference sequences (e.g., whole or partial genome, whole orpartial exome, SNP, gen, etc.) such that the data can be searched andretrieved manually (e.g., by a database administrator or clientoperator) or automatically by way of a computer program, application,and/or software script. In some embodiments, the sample nucleic acidsequencing read data can be stored on the sample sequence data storageand/or the reference data storage in a variety of different data filetypes/formats, including, but not limited to: *.txt, *.fasta, *.csfasta,*seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs and/or *.qv.

In some embodiments, the sample sequence data storage and the referencedata storage are independent standalone devices/systems or implementedon different devices. In some embodiments, the sample sequence datastorage and the reference data storage are implemented on the samedevice/system. In some embodiments, the sample sequence data storageand/or the reference data storage can be implemented on the analyticscomputing device/server/node. The analytics computing device/server/nodecan be in communications with the sample sequence data storage and thereference data storage either directly via a data cable (e.g., serialcable, direct cable connection, etc.) or bus linkage or, alternatively,through a network connection (e.g., Internet, LAN, WAN, VPN, etc.). Insome embodiments, analytics computing device/server/node can host areference mapping engine, a de novo mapping module, and/or a tertiaryanalysis engine. In some embodiments, the reference mapping engine canbe configured to obtain sample nucleic acid sequence reads from thesample data storage and map them against one or more reference sequencesobtained from the reference data storage to assemble the reads into asequence that is similar but not necessarily identical to the referencesequence using all varieties of reference mapping/alignment techniquesand methods. The reassembled sequence can then be further analyzed byone or more optional tertiary analysis engines to identify differencesin the genetic makeup (genotype), gene expression or epigenetic statusof individuals that can result in large differences in physicalcharacteristics (phenotype). For example, in some embodiments, thetertiary analysis engine can be configured to identify various genomicvariants (in the assembled sequence) due to mutations,recombination/crossover or genetic drift. Examples of types of genomicvariants include, but are not limited to: single nucleotidepolymorphisms (SNPs), copy number variations (CNVs),insertions/deletions (Indels), inversions, etc. The optional de novomapping module can be configured to assemble sample nucleic acidsequence reads from the sample data storage into new and previouslyunknown sequences. It should be understood, however, that the variousengines and modules hosted on the analytics computing device/server/nodecan be combined or collapsed into a single engine or module, dependingon the requirements of the particular application or systemarchitecture. Moreover, in some embodiments, the analytics computingdevice/server/node can host additional engines or modules as needed bythe particular application or system architecture.

In some embodiments, the mapping and/or tertiary analysis engines areconfigured to process the nucleic acid and/or reference sequence readsin color space. In some embodiments, the mapping and/or tertiaryanalysis engines are configured to process the nucleic acid and/orreference sequence reads in base space. It should be understood,however, that the mapping and/or tertiary analysis engines disclosedherein can process or analyze nucleic acid sequence data in any schemaor format as long as the schema or format can convey the base identityand position of the nucleic acid sequence.

In some embodiments, the sample nucleic acid sequencing read andreferenced sequence data can be supplied to the analytics computingdevice/server/node in a variety of different input data filetypes/formats, including, but not limited to: *.txt, *.fasta, *.csfasta,*seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs and/or *.qv.

Furthermore, a client terminal can be a thin client or thick clientcomputing device. In some embodiments, client terminal can have a webbrowser that can be used to control the operation of the referencemapping engine, the de novo mapping module and/or the tertiary analysisengine. That is, the client terminal can access the reference mappingengine, the de novo mapping module and/or the tertiary analysis engineusing a browser to control their function. For example, the clientterminal can be used to configure the operating parameters (e.g.,mismatch constraint, quality value thresholds, etc.) of the variousengines, depending on the requirements of the particular application.Similarly, client terminal can also display the results of the analysisperformed by the reference mapping engine, the de novo mapping moduleand/or the tertiary analysis engine.

The present technology also encompasses any method capable of receiving,processing, and transmitting the information to and from laboratoriesconducting the assays, information provides, medical personal, andsubjects.

9. Uses

The technology is not limited to particular uses, but finds use in awide range of research (basic and applied), clinical, medical, and otherbiological, biochemical, and molecular biological applications. Someexemplary uses of the technology include genetics, genomics, and/orgenotyping, e.g., of plants, animals, and other organisms, e.g., toidentify haplotypes, phasing, and/or linkage of mutations and/oralleles. Particular and non-limiting illustrative examples in the humanmedical context include testing for cystic fibrosis and fragile Xsyndrome.

In addition, the technology finds use in the field of infectiousdisease, e.g., in identifying infectious agents such as viruses,bacteria, fungi, etc., and in determining viral types, families,species, and/or quasi-species, and to identify haplotypes, phasing,and/or linkage of mutations and/or alleles. A particular andnon-limiting illustrative example in the area of infectious disease ischaracterization of human immunodeficiency virus (HIV) genetic elementsand identifying haplotypes, phasing, and/or linkage of mutations and/oralleles. Other particular and non-limiting illustrative examples in thearea of infectious disease include characterizing antibiotic resistancedeterminants; tracking infectious organisms for epidemiology; monitoringthe emergence and evolution of resistance mechanisms; identifyingspecies, sub-species, strains, extra-chromosomal elements, types, etc.associated with virulence, monitoring the progress of treatments, etc.

In some embodiments, the technology finds use in transplant medicine,e.g., for typing of the major histocompatibility complex (MHC), typingof the human leukocyte antigen (HLA), and for identifying haplotypes,phasing, and/or linkage of mutations and/or alleles associated withtransplant medicine (e.g., to identify compatible donors for aparticular host needing a transplant, to predict the chance ofrejection, to monitor rejection, to archive transplant material, formedical informatics databases, etc.).

In some embodiments, the technology finds use in oncology and fieldsrelated to oncology. Particular and non-limiting illustrative examplesin the area of oncology are identifying genetic and/or genomicaberrations related to cancer, predisposition to cancer, and/ortreatment of cancer. For example, in some embodiments the technologyfinds use in detecting the presence of a chromosomal translocationassociated with cancer; and in some embodiments the technology finds usein identifying novel gene fusion partners to provide cancer diagnostictests. In some embodiments, the technology finds use in cancerscreening, cancer diagnosis, cancer prognosis, measuring minimalresidual disease, and selecting and/or monitoring a course of treatmentfor a cancer.

In some embodiments, the technology finds use in characterizingnucleotide sequences. For example, in some embodiments, the technologyfinds use in detecting insertions and/or deletions (“indels”) in anucleotide (e.g., genome, gene, etc.) sequence. It is contemplated thatthe technology described herein provides improved indel detectionrelative to conventional technologies. In addition, the technology findsuse in detecting short tandem repeats (STRs), inversions, largeinsertions, and in sequencing repetitive (e.g., highly repetitive)regions of a nucleotide sequence (e.g., of a genome).

Although the disclosure herein refers to certain illustratedembodiments, it is to be understood that these embodiments are presentedby way of example and not by way of limitation.

EXAMPLES Example 1—Comparison with Illumina MiSeq

During the development of the technology provided herein, calculationswere performed to compare the performance of the technology providedherein (Tables 1 and 2, “SOD Library”) with conventional technologyprovided by Illumina in the MiSeq platform (Tables 1 and 2, “IlluminaAmplicon Library”). Data were collected for two scenarios varying, e.g.,the number of samples per run, criteria to measure throughput, etc. (seeTables 1 and 2).

As shown in Tables 1 and 2, the technology described herein decreasesinstrument run-time, has a higher throughput, and produces a higherpercentage of reads with quality scores greater than Q30 with respect toNGS library construction using the Illumina technology.

TABLE 1 comparison with Illumina MiSeq (Targeted Sequencing: AmpliconPanel) MiSeq (Sequencing reagent kit v2)^(a) Illumina SOD AmpliconLibrary Library # of samples per run 8 8 # of amplicons per sample 50 50Average size of amplicons (bp) 400 400 Required length of SBS read 1 ×50 2 × 250^(b) Total run time (hours)^(c) 3 37 Avg. coverage for each5357 37500 amplicon per sample^(d) Throughput (# of samples with 14.38.1 1000x coverage/hour)^(e) Quality scores (percent of >90% >75% readswith score > Q30)^(f) ^(a)MiSeq Reagent kit v2: Dual-surface scanning,12-15 million clusters passing filter ^(b)To cover the entire 400 bpamplicon, a 2 × 250 bp pair-end read strategy is implemented where thereads are overlapped by ~100 bp ^(c)Actual sequencing portion only (doesnot include cluster generation time) ^(d)To calculate coverage for SODlibrary: [(Total # of reads)/((insert size − SOD readlength) × (# ofsamples in a run × # of amplicons per sample))] × SOD readlength: e.g.,[(15 × 10⁶)/((400 − 50) × (8 × 50))] × 50 ^(e)To calculate throughput:[(mean coverage)/1000]/(total run time) ^(f)Based on MiSeq sequencingspecification provided by Illumina, e.g., in their online materials.

TABLE 2 comparison with Illumina MiSeq (Targeted Panel Sequencing of 400bp insert) MiSeq (seq kit v2)^(a) Targeted Panel Sequencing (400 bpinsert) Illumina SOD Amplicon Library Library # of samples per run 8 856 # of amplicons per sample 50 50 50 Average size of amplicons (bp) 400400 400 Required length of SBS read 1 × 50 2 × 250^(b) 2 × 250^(b) Totalrun time (hours)^(c) 4 38 38 Mean coverage for each 5357 37500 5357amplicon per sample^(d) Throughput (# of samples with 7.1 4.1 4.1 2000xcoverage/hour)^(e) Quality scores (percent of >90% >75% >75% reads withscore > Q30)^(f) ^(a)MiSeq Reagent kit v2: Dual-surface scanning, 15million clusters passing filter ^(b)To cover the entire 200 or 400 bpamplicon, a 2 × 150 or 2 × 250 bp (respectively) pair-end read strategyis implemented where the reads are overlapped by ~100 bp ^(c)Actualsequencing portion only (does not include cluster generation time)^(d)To calculate coverage for SOD library: [(Total # of reads)/((insertsize − SOD readlength) × (# of samples in a run x # of amplicons persample))] × SOD readlength: e.g., [(15 × 10⁶)/((400 − 50) × (8 × 50))] ×50 ^(e)To calculate throughput: [(mean coverage)/2000]/(total run time)^(f)Based on MiSeq sequencing specification provided by Illumina, e.g.,in their online materials.

Example 2—Comparison with Ion Torrent PGM (Targeted Sequencing: AmpliconPanel)

During the development of the technology provide herein, calculationswere performed to compare the performance of the technology providedherein (Tables 3 and 4, “SOD Library”) with conventional technologyprovided by Ion Torrent in the PGM platform (Tables 3 and 4, “IonAmplicon Library”). Data were collected for two scenarios varying, e.g.,the number of samples per run, criteria to measure throughput, etc. (seeTables 3 and 4).

As shown in Tables 3 and 4, the technology described herein decreasesinstrument run-time and produces a higher percentage of reads withquality scores greater than Q20 with respect to NGS library constructionusing the Ion Torrent technology.

TABLE 3 comparison with Ion Torrent PGM Ion PGM (400 bp Sequencingreagent kit v2)^(a) Ion SOD Amplicon Library Library # of samples perrun 1 1 # of amplicons per sample 50 50 Average size of amplicons (bp)400 400 Required length of SBS read 1 × 50 1 × 400 (bi-directional)^(b)Total run time (hours)^(c) 0.5 4 Avg. coverage for each 1143 8000amplicon per sample^(d) Throughput (# of samples with 2.3 2.0 1000xcoverage/hour)^(e) Quality scores (percent of >90% >50% reads withscore > Q20)^(f) ^(a)PGM 400 bp Sequencing Reagent kit v2 ^(b)To coverthe entire 400 bp amplicon, a 1 × 400 bp bi-directional sequencing isperformed ^(c)Actual sequencing portion only (does not include OneTouch2and other pre- sequencing process time) ^(d)To calculate coverage forSOD library: [(0.4 × 10⁶)/((400 − 50) × (8 × 50))] × 50 ^(e)To calculatethroughput: [(avg. coverage)/1000]/(total run time) ^(f)Projected basedon: Loman N. et al. (2012) “Performance comparison of benchtophigh-throughput sequencing platforms” Nature Biotechnology, vol. 30-5.

TABLE 4 comparison with Ion Torrent PGM Ion PGM (318 v2/seq kit)^(a)Targeted Panel Sequencing (400 bp insert) Ion SOD Amplicon LibraryLibrary # of samples per run 4 4 28 # of amplicons per sample 50 50 50Average size of amplicons (bp) 400 400 400 Required length of SBS read 1× 50 1 × 400 (bi-directional)^(b) Total run time (hours)^(c) 0.5 7.257.25 Mean coverage for each 4286 30000 4286 amplicon per sample^(d)Throughput (# of samples with 17.1 8.3 8.3 2000x coverage/hour)^(e)Quality scores (percent of >90% >50% >50% reads with score > Q20)^(f)^(a)Ion PGM chip 318/v2: ~6 million load wells producing reads passingfilter ^(b)To cover the entire 200-bp or 400-bp amplicon, a 200-bp(bi-directional) or 400-bp (bi-directional) strategy is implemented,respectively ^(c)Actual sequencing portion only (does not includeePCR/enrichment) ^(d)To calculate coverage for SOD library: [(# of totalreads)/((insert size − SOD readlength) × (# of samples × # ofamplicon))] × SOD readlength, e.g., [(15 × 10⁶)/((400 − 50) × (8 × 50))]× 50 ^(e)To calculate throughput: [(mean coverage)/2000]/(totalrun-time) ^(f)Based on Ion Torrent sequencing specification available inthe Ion Torrent online materials

Example 3—Comparison Technologies for Long Reads

Tables 5 and 6 compare the performance of the technology provided hereinwith conventional technologies for sequencing long amplicons ofapproximately 1000 bp (Table 5) and 2000 bp (Table 6). Run-time does notincrease with amplicon size for the present technology because the readsize is ˜30-50 bases regardless of the size of the target nucleic acidto be sequenced. In some embodiments, a 2000-bp sequence is produced bythe technology provided herein in a time that is an order of magnitudeless than the conventional technology (see, e.g., Table 6). In someembodiments, the technology provided herein provides a longer sequenceread with the same run time as the conventional technology.

TABLE 5 comparison for long-amplicon sequencing 1000 bp Illumina Ion SODTruSeq gDNA Library^(a) Library Library # of samples per run 8 8 1 # ofamplicons per sample 50 50 50 Average size of amplicons (bp) 1000 10001.000 Required length of SBS read 1 × 50 2 × 250 1 × 400 (pair-end)(bi-directional) Total run time (hours) 3 37 4 Avg. coverage for each1974 — — amplicon per sample Throughout (# of samples with 5.3 — — 1000xcoverage/hour) Quality scores (percent of >90% — — reads with score >Q30) ^(a)SOD library run on a MiSeq with sequencing reagent kit v2

TABLE 6 comparison for long-amplicon sequencing 2000 bp Long ReadApplication (2 Kb insert size) Ion SOD Illumina PGM Library^(a) Lib Lib# of samples per run 8  8 4 # of amplicons per sample 50 50 50 Averagesize of amplteons (bp) 2000 2000  2000 Required length of SBS read 1 ×50 2 × 250 1 × 400 Total run time (hours) 3 on MiSeq;  37^(b) 7.25 0.5on PGM Mean coverage for each 962 — — amplicon per sample Throughput (#of samples with 2.6 — — 2000x coverage/hour) Quality scores (percentof >90% — — reads with score > Q30) Cost per run (seq reagent and 725 —— chip only, $) Cost per sample ($) 90.63 — — ^(a)SOD library prep timefor longer insert size is longer in some embodiments (e.g., from ~6.5hours to ~8.5 hours) ^(b)Illumina “Moleculo” technology

Example 4—Concept Verification of Data Obtained Using a Model Library

During the development of embodiments of the technology provided here,data were collected to verify the technology using a model library. Asshown in FIG. 4, a consensus sequence of ˜127 bp is constructed from acollection of ˜35-bp reads produced according to embodiments of thetechnology provided. The calculated sequencing run time on an IlluminaMiSeq DNA sequencing apparatus to produce the ˜127-bp sequence using alibrary produced by the technology provided herein is approximately 2.5hours. Using the conventional technology to provide the library, a runtime of ˜13 hours produces the same ˜127-bp sequence read.

Example 5—Ladder Generation Using 3′-O-Propargyl dNTP Termination

During the development of embodiments of the technology provided herein,experiments were conducted to assess the generation of terminatednucleic acid fragments in a reaction comprising a mixture of3′-O-propargyl-dNTPs and natural (standard) dNTPs. In particular,experiments were conducted to assess the generation of fragmentsterminated at each position within the target region by incorporation ofchain-terminating 3′-O-propargyl-dNTPs by DNA polymerase duringsynthesis. Polymerase extension assays were conducted using a templatenucleic acid having a sequence from human KRAS (e.g., KRAS exon 2 andflanking intron sequences) and a complementary primer:

KRAS Exon 2 Template (SEQ ID NO: 1)TTATTATAAGGCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAGGTAAATCTTGTTTTAATATGCATATTACTGGTGCAGGACCATTCT R_ke2_trP1_T_bio (SEQ ID NO: 2)bTAAUCCTCTCTATGGGCAGTCGGTGATAGAATGGTCCTGCACCAGTAA

In the R_ke2_trP1_T_bio primer sequence (SEQ ID NO: 2), a “b” indicatesa biotin modification and a “U” indicates a deoxyuridine modification.Incorporation of the primers into extension products produces extensionproducts comprising a uracil. The uracil is useful, e.g., for cleavageof the product (e.g., using uracil cleavage reagents) in a number ofmolecular biological manipulations (e.g., cleaving the product from asolid support).

Experiments were conducted using a mixture of natural dNTPs and all fourof the 3′-O-propargyl-dNTPs in a single reaction. The DNA fragmentgeneration reaction mix comprised 20 mM Tris-HCl, 10 mM (NH₄)SO₄, 10 mMKCl, 2 mM MnCl₂, 0.1% Triton X-100, 1000 pmol dATP, 1000 pmol dCTP, 1000pmol dGTP, 1000 pmol dTTP, 100 pmol of 3′-O-propargyl-dATP, 100 pmol of3′-O-propargyl-dCTP, 100 pmol of 3′-O-propargyl-dGTP, 100 pmol of3′-O-propargyl-dTTP, 6.25 pmol of primer R_ke2_trP1_T_bio (SEQ ID NO:2), and 2 units of THERMINATOR II DNA polymerase (New England BioLabs)in a 25-μ1 reaction volume. 0.5 pmol of purified amplicon correspondingto a region in KRAS exon 2 (SEQ ID NO: 1) was used as template. Thepolymerase extension reaction was thermocycled by heating to 95° C. for2 minutes, followed by 45 cycles at 95° C. for 15 seconds, 55° C. for 25seconds, and 65° C. for 35 seconds.

After the polymerase extension reaction, 1 μl of the reaction mix wasused directly for DNA fragment size analysis using gel electrophoresis(Agilent 2100 Bioanalyzer and High Sensitivity DNA Assay Chip). Fragmentsize analysis of the reaction products indicated that the fragmentgeneration reaction successfully produced a ladder of nucleic acidfragments having the expected sizes.

Example 6—Synthesis of 5′-Azido-Methyl-Modified Oligonucleotide

During the development of embodiments of the technology provided herein,an oligonucleotide comprising a 5′-azido-methyl modification wassynthesized and characterized. Synthesis of the modified oligonucleotidewas performed using phosphoramidite chemical synthesis. In the lastsynthetic step, phosphoramidite chemical synthesis was used toincorporate a 5′-iodo-dT phosphoramidite at the terminal 5′ position.The oligonucleotide attached to the solid support in the reaction columnwas then treated as follows.

First, sodium azide (30 mg) was resuspended in dry DMF (1 ml), heatedfor 3 hours at 55° C., and cooled to room temperature. The supernatantwas taken up with a 1-ml syringe and passed back and forth through thereaction column comprising the 5′-iodo-modified oligonucleotide andincubated overnight at ambient (room) temperature. After incubation, thecolumn was washed with dry DMF, washed with acetonitrile, and then driedvia argon gas. The resulting 5′-azido-methyl-modified oligonucleotidewas cleaved from the solid support and deprotected by heating in aqueousammonia for 5 hours at 55° C. The final product was an oligonucleotidehaving the sequence shown below:

(SEQ ID NO: 3) Az-TCTGAGTCGGAGACACGCAGGGATGAGATGGTThe “Az” indicates the azido-methyl modification at the 5′ end (e.g.,5′-azido-methyl modification), e.g., to provide an oligonucleotidehaving a structure according to

where B is the base of the nucleotide (e.g., adenine, guanine, thymine,cytosine, or a natural or synthetic nucleobase, e.g., a modified purinesuch as hypoxanthine, xanthine, 7-methylguanine; a modified pyrimidinesuch as 5,6-dihydrouracil, 5-methylcytosine, 5-hydroxymethylcytosine;etc.).

Example 7—Conjugation of 5′-Azido-Methyl-Modified Oligonucleotide and3′-O-Propargyl-Modified Nucleic Acid Fragments

During the development of embodiments of the technology provided herein,experiments were conducted to test the conjugation of a5′-azido-methyl-modified oligonucleotide (e.g., see Example 6) to3′-O-propargyl-modified nucleic acid fragments (e.g., see Example 5) byclick chemistry. In particular, experiments were conducted in which a5′-azido-methyl-modified oligonucleotide was chemically conjugated to3′-O-propargyl-modified DNA fragments using copper (I) catalyzed1,3-dipolar alkyne-azide cycloaddition chemistry (“click chemistry”).

Click chemistry was performed using commercially available reagents(baseclick GmbH, Oligo-Click-M Reload kit) according to themanufacturer's instructions. Briefly, approximately 0.1 pmol of3′-O-propargyl-modified DNA fragments comprising a 5′-biotinmodification were reacted with approximately 500 pmol of5′-azido-methyl-modified oligonucleotide using the click chemistryreagent in a total volume of 10 μl. The reaction mixture was incubatedat 45° C. for 30 minutes. Following the incubation, the supernatant wastransferred to a new microcentrifuge tube and a 40-μ1 volume of thecommercially supplied binding and wash buffer (e.g., 1 M NaCl, 10 mMTris-HCl, 1 mM EDTA, pH 7.5) was added. The conjugated reaction productwas isolated from the excess 5′-azido-methyl-modified oligonucleotide byincubating the click chemistry reaction mixture with streptavidin-coatedmagnetic beads (Dynabeads, MyOne Streptavidin Cl, Life Technologies) atambient (room) temperature for 15 minutes. The beads were separated fromthe supernatant using a magnet and the supernatant was removed.Subsequently, the beads were washed twice using the binding and washbuffer and then resuspended in 25 μl of TE buffer (10 mM Tris-HCl, 0.1mM EDTA, pH approximately 8).

The product was cleaved from the solid support (bead) using uracilcleavage (Uracil Glycosylase and Endonuclease VIII, Enzymatics). Inparticular, uracil cleavage reagents were used to cleave the reactionproducts at the site of the deoxyuridine modification located near the5′-terminal location of the conjugated product (see SEQ ID NOs: 2-5).Finally, the supernatant comprising the conjugated product was purifiedusing Ampure XP (Beckman Coulter) following the manufacturer's protocoland eluted in 20 μl of TE buffer.

Example 8—Amplification of Conjugated Product

During the development of embodiments of the technology describedherein, experiments were performed to characterize the chemicalconjugation of the 5′-azido-methyl-modified oligonucleotide to the3′-O-propargyl modified nucleic acid fragments and to evaluate thetriazole linkage as a mimic of a natural phosphodiester bond in anucleic acid backbone. To test the ability of a polymerase to recognizethe conjugated product as a template and traverse the triazole linkageduring synthesis, PCR primers were designed to produce amplicons thatspan the triazole linkage of the conjugation products:

Primer 1 SEQ ID NO: 4 CCTCTCTATGGGCAGTCGGTGAT Primer 2 SEQ ID NO: 5CCATCTCATCCCTGCGTGTCTC

A commercially available PCR pre-mix (KAPA 2G HS, KAPA Biosystems) wasused to provide a 25-μl reaction mixture comprising, in addition tocomponents provided by the mix (e.g., buffer, polymerase, dNTPs), 0.25μM Primer 1 (SEQ ID NO: 4), 0.25 μM of Primer 2 (SEQ ID NO: 5), and 2 μlof purified conjugated product (see Example 7) as template foramplification. The reaction mixture was thermally cycled by incubatingthe sample at 95° C. for 5 minutes, followed by 30 cycles of 98° C. for20 seconds, 60° C. for 30 seconds, and 72° C. for 20 seconds. Theamplification products were analyzed by gel electrophoresis (e.g., usingan Agilent Bioanalyzer 2100 system and High-Sensitivity DNA Chip) todetermine the size distributions of the reaction products.

Analysis of the amplification products indicated that the amplificationreaction successfully produced amplicons using the conjugated productsof the click chemistry reaction (see Example 7) as templates foramplification. In particular, analysis of the amplification productsindicated that the polymerase processed along the template and throughthe triazole linkage to produce amplicons from the template. Further,the amplification produced a heterogeneous population of ampliconshaving a range of sizes corresponding to the expected sizes produced byamplification of the base-specific terminated DNA fragments viaincorporation of the 3′-O-propargyl-dNTP. The fragment analysis alsoshowed the proper fragment size increase corresponding to thirty one(31) additional bases from the conjugated 5′-azido-methyl-modifiedoligonucleotide.

Example 9—Ligation of NGS Adaptors to Fragment Ladder Products

During the development of embodiments of the technology provided herein,experiments were conducted to sequence ladder fragments producedaccording to the technology provided herein (see FIG. 5). As an initialstep in sequencing, experiments were conducted to prepare a sequencinglibrary using DNA ladder products generated in Example 8 as input and acommercial kit for sample preparation. Sequencing libraries wereprepared using a TRUSEQ NANO DNA sample preparation kit (Illumina, Inc.)following the manufacturer protocol with the following modification.After the adaptor ligation step, two rounds (instead of one round) ofbead-based purification were performed using a 1:1 (v/v) sample tobead-mix ratio. 8 amplification cycles were performed using the providedIllumina PCR primers to enrich the adaptor-ligated products followingthe manufacturer protocol. The final sequencing library was analyzed bygel electrophoresis (Agilent 2100 Bioanalyzer and High Sensitivity DNAAssay Chip). Fragment size analysis confirmed the successful generationof a NGS library (e.g., for Illumina sequencing) using the fragmentladder products of Example 8. The data indicated that the NGS libraryhad the proper fragment size increase corresponding to the addition ofthe 126-bp Illumina adaptors and thus that the adaptors were properlyligated to the fragment ladder. FIG. 5 shows a schematic of fragments ofthe sequencing library. In particular, the fragments comprise anIssumina adaptor on both ends, one or more universal sequence, and atarget sequence.

Example 10—Sequencing

During the development of embodiments of the technology provided herein,experiments were conducted to sequence an adaptor-ligated NGS library,e.g., a sequencing library prepared as described in Example 9. Thelibrary produced according to Example 9 was successfully sequenced usingan Illumina MiSeq sequencer using a 2×75-bp sequencing-by-synthesis kit.Sequencing primers complementary to the adaptor sequences are providedby the kit. After sequencing, more than 89% of the reads had a sequencequality score of Q30 or better.

Data collected from the experiments indicated that the fragmentpopulation provides for the unambiguous alignment of the shortsequencing reads (30-50 bp) produced by the technology. In particular,the overlapping nucleic acid fragments provided reads that weresuccessfully aligned and assembled despite their small size.

Sequence data were extracted from the sequencer output using a customdata processing work-flow that accommodates for the particular design ofthe fragment ladder produced according to the technology. For example,the custom software identified reads and processed reads to use 40-bpportions of the 2×75-bp sequence reads for subsequent sequencealignment. Particular components of the custom software concatenatereads (e.g., Read1 and Read2 FASTQ files) produced from the NGSsequencer; identify sequence originating from the target sequence,universal sequence, and adaptors (e.g., identify sequence originatingfrom the 5′-azido-methyl-oligonucleotide); set a sequence extractionboundary using pattern recognition; extract the target sequence from thesequence reads produced from the NGS sequencer; and align the sequences(see FIG. 5).

Example 11—Sequence Alignment

During the development of embodiments of the technology provided herein,experiments were conducted to align sequence data produced from an NGSlibrary as described herein, produce a consensus sequence from thealignment, and align the consensus sequence to a reference sequence. Inparticular, 40-bp sequence reads that were extracted from the MiSeqsequencing output were aligned against a reference sequence (e.g., a177-bp sequence comprising human KRAS gene exon 2 partial flankingintron sequences).

Alignment of the 40-bp sequencing reads was performed using CLC GenomicsWorkbench v7 with stringent penalties for mismatches and indels; lengthand similarity match requirements were appropriately set according tothe accompanying instructions for 40-bp reads. The alignment results(FIG. 6A) indicated that 40-bp sequence reads provided complete coverageof the entire reference sequence (177 bp). Further, the plot of coveragedepth versus sequence position had the expected “trapezoidal” coverageprofile that was elucidated during theoretical alignment simulation(FIG. 6B).

These results indicate that a relatively short sequencing run (e.g.MiSeq with 30 to 50 sequencing-by-synthesis cycles) produces a complete,high-quality sequence of the target. Further, with adjustments toexisting methods, e.g., designing primers to bind immediately adjacentto the target site, the length of high-quality sequence can bemaximized. Further, the length of high-quality sequence can also bemaximized with appropriate generation of the fragment ladder to coverthe entire length of the entire length of the target (e.g., by adjustingthe ratio of 3′-O-propargyl-dNTPs to dNTPs; see Example 12). In thisexample, 40 sequencing cycles (to obtain 40 bases of sequence) on theMiSeq took approximately 2.5 hours. Importantly, though, the technologyprovides an improvement over existing technologies in that the sequencerrun-time does not change depending on the target size.

Example 12—Sequencing and Analysis of NGS Libraries

During the development of embodiments of the technology provided herein,experiments were conducted to control the size distribution ofterminated nucleic acid fragments produced in a reaction comprising amixture of 3′-O-propargyl-dNTPs and natural (standard) dNTPs byadjusting the ratio of 3′-O-propargyl-dNTPs to natural (standard) dNTPs.It was contemplated that the molar ratio of 3′-O-propargyl-dNTPs andnatural dNTPs affects the fragment size distribution due to competitionbetween the 3′-O-propargyl-dNTPs (that terminate extension) and naturaldNTPs (that elongate the polymerase product) for incorporation into thesynthesized nucleic acid by the polymerase. Accordingly, experimentswere performed in which the products of fragment ladder generationreactions were assessed at various molar ratios of 3′-O-propargyl-dNTPsto natural dNTPs. Fragment ladder generation reactions were performedusing 2:1, 10:1, and 100:1 molar ratios of natural dNTPs to3′-O-propargyl-dNTPs. The fragment generation reaction mixtures used inthese experiments comprised 20 mM Tris-HCl, 10 mM (NH₄)SO₄, 10 mM KCl, 2mM MnCl₂, 0.1% Triton X-100, 1000 pmol dATP, 1000 pmol dCTP, 1000 pmoldGTP, 1000 pmol dTTP, 6.25 pmol of primer, 2 units of Therminator II DNApolymerase (New England BioLabs), and 0.5 pmol of purified ampliconcorresponding to a region in KRAS exon 2 (SEQ ID NO: 1) as template in a25-μl final reaction volume.

In addition, reactions testing a 2:1 ratio of natural dNTPs to3′-O-propargyl-dNTPs comprised 500 pmol of 3′-O-propargyl-dATP, 500 pmolof 3′-O-propargyl-dCTP, 500 pmol of 3′-O-propargyl-dGTP, and 500 pmol of3′-O-propargyl-dTTP. Reactions testing a 10:1 ratio of natural dNTPs to3′-O-propargyl-dNTPs comprised 100 pmol of 3′-O-propargyl-dATP, 100 pmolof 3′-O-propargyl-dCTP, 100 pmol of 3′-O-propargyl-dGTP, and 100 pmol of3′-O-propargyl-dTTP. Reactions testing a 100:1 ratio of natural dNTPs to3′-O-propargyl-dNTPs comprised 10 pmol of 3′-O-propargyl-dATP, 10 pmolof 3′-O-propargyl-dCTP, 10 pmol of 3′-O-propargyl-dGTP, and 10 pmol of3′-O-propargyl-dTTP

The polymerase extension reactions were temperature cycled by incubatingat 95° C. for 2 minutes, followed by 45 cycles at 95° C. for 15 seconds,55° C. for 25 seconds, and 65° C. for 35 seconds. After the polymeraseextension reaction, 5′-azido-methyl-modified oligonucleotides werechemically conjugated to the nucleic acid fragments terminated with3′-O-propargyl-dN using click chemistry as described in Example 6 andExample 7. After the conjugation, the conjugation products were used astemplates for amplification to produce amplicons corresponding to theconjugated products as described in Example 8. Fragment size analysiswas performed on the conjugated products.

Fragment size analysis of the amplified conjugation products producedfrom the products of the three different molar ratio conditionsindicated that the fragment size depended on the ratio of3′-O-propargyl-dNTPs to natural dNTPs. Analysis of the fragment sizesshows a fragment size distribution shift as a function of the molarratios of dNTP to 3′-O-propargyl-dNTP. At the 2:1 molar ratio, largerpopulations of shorter fragments were detected compared to the other twomolar ratio conditions. At the 10:1 molar ratio, a larger fraction oflonger fragments was present relative to the 2:1 molar ratio. At the100:1 molar ratio, the major population of fragments comprised longerDNA fragments relative to the other two molar ratios.

The ladder fragments produced with the three different molar ratios wereused as separate inputs to generate NGS (Illumina) libraries forsequencing on the MiSeq sequencer as described in Example 9.Furthermore, sequence reads were obtained as described in Example 10 andsequence data from the target sequence was extracted and analyzed asdescribed in Example 11.

The coverage profiles of the three libraries that were prepared usingthe three different molar ratios of dNTP to 3′-O-propargyl-dNTP (molarratios of 2:1, 10:1, and 100:1) correlated with the DNA ladder fragmentsize distribution created by the respective molar ratios. For example,the 2:1 molar ratio of dNTP to 3′-O-propargyl-dNTP was expected toterminate polymerase extension at a high frequency due to the relativelyhigh abundance of 3′-O-propargyl-dNTP and thus produce nucleic acidladder fragments that are relatively shorter that at higher ratios ofdNTP to 3′-O-propargyl-dNTP. In contrast, the 100:1 molar ratio wasexpected to terminate polymerase extension at a low frequency due to therelatively low abundance of 3′-O-propargyl-dNTP and thus produce nucleicacid ladder fragments that are relatively longer that at lower ratios ofdNTP to 3′-O-propargyl-dNTP.

The data collected from the fragment size analysis of the DNA ladderproducts generated using the three different molar ratios confirmedthese predictions. In particular, the data indicate that varying themolar ratio of dNTP to 3′-O-propargyl-dNTP provides for the control ofDNA ladder fragment size.

Furthermore, sequencing of the DNA ladder products generated using thethree different molar ratios and analysis of the sequence produced fromthe ladder products showed that the sequence coverage profilescorrelated with the molar ratio of dNTP to 3′-O-propargyl-dNTP usedduring DNA ladder generation. In particular, the data indicated that the2:1 molar ratio provided more coverage of sequence near the binding siteof the sequencing primer and the 100:1 molar ratio provided morecoverage further from the binding site of the sequencing primer.Accordingly, the technology provides the ability to control DNA ladderfragment generation for a variety of sequencing applications. Inparticular, increasing coverage distant from the sequencing primerbinding site is useful for sequencing applications related to long(e.g., greater than 100 base pairs) sequencing applications. Sequencingusing multiple sequencing libraries produced at different molar ratiosprovides sequence data having high coverage of sequences that are near,intermediate, and far from the binding site of the sequencing primer.

Example 13—Tagging with Primers Comprising an Index Sequence

During the development of embodiments of the technology provided herein,experiments were conducted to assess the use of index or barcodesequences to track and construct the sequence of the original targettemplate from the sequence produced from library generation, NGS, andalignment. In the first set of experiments, target nucleic acids werecopied and tagged by polymerase extension reactions usingtarget-specific primers comprising a uniquely identifying indexsequence. As used herein, this and similar molecular barcodingapproaches are referred to as a “copy and tag reaction” or a “copy andID-tag reaction”.

In this scheme, a polymerase extension primer was designed thatcomprises two regions (FIG. 7): a 3′ region comprising a target-specificpriming sequence and a 5′ region comprising two different universalsequences (e.g., universal sequence A and universal sequence B) flankinga degenerate sequence (e.g., comprising 8 bp). Oligonucleotide primerswere synthesized according to this scheme and used in polymeraseextension reactions with a second oligonucleotide designed to stop thepolymerase extension and thus “copy and tag” only the target region ofinterest:

polymerase extension primer Eg_e19_R_SOD_v03-01-bio (SEQ ID NO: 6)bTAAUTAGTGGCTGACGGGTATCTCTCACCTTTNNNNNNNNCAGACATGA GAAAAGGTGGGCpolymerase extension blocker Eg_e19_SOD_SC-200_v1 (SEQ ID NO: 7)C*A*ATTGTGAGATGGTGCCACATGCTGCam

In the sequences of the polymerase extension primer and polymeraseextension blocker used in polymerase extension reaction during “copy andtag” procedure (SEQ ID NOs: 6 and 7 above), a “b” indicates a 5′-biotinmodification, a “U” indicates a deoxyuridine modification, a “f”indicates a phosphorothioate bond, and “am” indicates a 3′-aminomodification.

Polymerase extension reactions were performed using a commerciallyavailable high-fidelity polymerase master mix kit (KAPA HiFi HotStartPCR kit, KAPA Biosystems) to produce a reaction mixture comprising 1pmol of polymerase extension primer (e.g., Eg_e19_R_SOD_v03-01-bio), 1pmol of polymerase extension blocker (e.g., Eg_e19_SOD_SC-200_v1), and100 ng of purified genomic DNA extracted from a human lungadenocarcinoma/non-small cell lung cancer cell line (Cell line NCI-H1975available from ATCC under accession CRL-5908) in a 25-μl reactionvolume. Polymerase extension reactions were incubated at 95° C. for 2minutes, 98° C. for 30 seconds, 58° C. for 90 seconds, and 65° C. for 30seconds. The dNTP and KAPA HiFi polymerase were added immediately afterthe completion of the 58° C. incubation step.

The polymerase extension reaction products were purified usingbead-based purification (Ampure XP, Beckman Coulter) following themanufacturer protocol to remove polymerase extension primers, polymeraseextension blockers, and other extension reaction components. Then, asolid phase capture-based purification using streptavidin-coatedmagnetic microspheres (Dynabeads, MyOne Streptavidin Cl, LifeTechnologies) was used to isolate the polymerase extension reactionproducts from the genomic DNA template. After isolating the polymeraseextension reaction products, a 2× binding and wash buffer (2 M NaCl, 20mM Tris-HCl, 2 mM EDTA, pH 7.5) was added to the eluent from the beadpurification at a 1:1 (v/v) ratio and incubated with the streptavidinbeads at ambient (room) temperature for 15 minutes. The beads wereseparated from the supernatant using a magnet and the supernatant wasremoved. Next, the beads were washed twice using binding and wash bufferand resuspended in 25 μl of TE buffer (10 mM Tris-HCl, 0.1 mM EDTA, pHapproximately 8). The beads were incubated with a solution of 0.1 M NaOHand 0.1 M NaCl for 1 minute to remove any traces of remaining genomicDNA. The beads were then separated from the supernatant using a magnet(the supernatant was discarded), the beads were washed twice usingbinding and wash buffer, and resuspended in 25 μl of TE buffer (10 mMTris-HCl, 0.1 mM EDTA, pH approximately 8).

Finally, to release the bead-bound product, a uracil cleavage system(Uracil Glycosylase and Endonuclease VIII, Enzymatics) was used tocleave the bead-bound polymerase extension product at the deoxyuridinemodification incorporated into the 5′ end of the polymerase extensionproduct as a result of extension of the polymerase extension primer (seeSEQ ID NO: 6). The supernatant comprising the polymerase extensionproduct was purified using Ampure XP (Beckman Coulter) following themanufacturer protocol and eluted in 20 μl of TE buffer.

Amplification primers Uni_R_v2 and e19_F_v1 were designed, synthesized,and used to amplify the purified polymerase extension product to confirmgeneration of the copy and tag product as described schematically inFIG. 8. Amplification primers Uni_R_v2 and SC-240_COM_v1 were used toconfirm that the polymerase extension blocker effectively blockedpolymerase extension past the site at which the polymerase extensionblocker binds to the template.

Uni_R_v2 (SEQ ID NO: 8) AGTGGCTGACGGGTATCTCTC e19_F_v1 (SEQ ID NO: 9)TGCCAGTTAACGTCTTCCTTCT SC-240_COM_v1 (SEQ ID NO: 10) ATCACTGGGCAGCATGTGG

Two amplification reactions were performed on the polymerase extensionproduct. A first reaction comprised the primers Uni_R_v2 and e19_F-v1,which amplify both blocked (via polymerase extension blocker) andnon-blocked polymerase extension products. A second reaction comprisedthe primers Uni_R_v2 and SC-240_COM_v1, which amplify only non-blockedpolymerase extension product. The two types of reaction mixtures wereproduced using a commercially available amplification mix (KAPA 2G HS,KAPA Biosystems) and 0.25 μM of each primer (as indicated above for thetwo reactions) in a 25-μl final reaction volume. A 5-μl volume ofpurified polymerase extension product was used as template for eachamplification reaction. The amplification reactions were thermocycled byincubating the reaction mixtures at 95° C. for 5 minutes, followed by 30cycles of 98° C. for 20 seconds, 60° C. for 30 seconds, and 72° C. for20 seconds. The amplification products were analyzed by gelelectrophoresis (e.g., using an Agilent Bioanalyzer 2100 system and aHigh-Sensitivity DNA chip) to determine the fragment size distributions.

Data collected from fragment size analysis indicated that theamplification reaction comprising primers Uni_R_v2 and e19_F_v1 produceda product of the expected size. Furthermore, the data also indicatedthat the amplification reaction comprising primers Uni_R_v2 andSC-240_COM_v1 did not generate a detectable product, thus indicatingthat the polymerase extension blocker effectively stop the polymerasereaction. Accordingly, the technology provides for precise control ofthe copy and tag reaction to produce products only from a target regionof interest.

Example 14—Tagging with Adaptors Comprising an Index Sequence

Further, in a second set of experiments conducted during the developmentof embodiments described herein, target nucleic acids were copied andsubsequently tagged by adaptor ligation using adaptors comprising auniquely identifying index sequence. In this molecular barcoding schemebased on adaptor ligation (see, e.g., FIG. 9), a DNA adaptor wasconstructed using two oligonucleotides. The first oligonucleotide wasdesigned to have a stretch of degenerate sequence (e.g., comprising 8 to12 bases) flanked on both the 5′ end and the 3′ end by two differentuniversal sequences (e.g., universal sequence A and universal sequenceB; see FIG. 9). The second oligonucleotide was designed to comprise auniversal sequence C (e.g., at the 5′ end) and a sequence (e.g., at the3′ end) that is complementary to universal sequence B and that has anadditional T at the 3′-terminal position. To produce the DNA adaptor,the two oligonucleotides were mixed in equal molar amounts, incubated at95° C. for 5 minutes, and then cooled slowly to ambient (room)temperature to provide for efficient hybridization of the complementaryportions of the two oligonucleotides (e.g., universal sequence B and itscomplementary sequence). Ligation of these adaptors to target DNAprovides for the unique ‘ID-tagging’ of each individual target DNAmolecule (e.g., each individual PCR amplicon), e.g., in a reactioncomprising a molar excess of unique ID-tag sequence adaptors relative tothe number of individual target molecules.

Experiments were conducted to tests embodiments of this technology usingthe following oligonucleotides:

ST-adN10-phos-v1 (SEQ ID NO: 11)pGTGGCTGACGGGTATCTCTCNNNNNNNNNNATCACCGACTGCCCATAGA GAGG ST-ad-T-v1(SEQ ID NO: 12) GCACTGGATCACGTCATACCTACGAGAGATACCCGTCAGCCA*C*T

In the sequences of the two oligonucleotides used to form the adaptor(SEQ ID NOs: 11 and 12 above), a “p” indicates a 5′-phosphatemodification, an “N” indicates a degenerate base position (e.g., theposition can be A, C, G, or T), and a “f” indicates a phosphorothioatebond.

As a first step, an amplification reaction was performed to amplify a158-bp region in exon 18 (with flanking intron sequence) of the humanEGFR gene using the following primers:

E_e18_f_v1p (SEQ ID NO: 13) pCCAGTGGAGAAGCTCCCAAC E_e18_r_v1p(SEQ ID NO: 14) pCAGACCATGAGAGGCCCTG

In the sequences of the two EGFR primers (SEQ ID NOs: 13 and 14 above),a “p” indicates a 5′-phosphate modification. Reaction mixtures wereproduced using a commercially available PCR master mix kit (KAPA 2GHotStart PCR kit, KAPA Biosystems), 10 pmol each of the EGFR primers(SEQ ID NOs: 13 and 14), and 10 ng of purified genomic DNA extractedfrom a human lung adenocarcinoma/non-small cell lung cancer cell line(Cell line NCI-H1975 available from ATCC under accession CRL-5908) in25-μl reaction volume. The reaction mixtures were thermocycled byincubating at 95° C. for 2 minutes, followed by 23 cycles of 98° C. for20 seconds, 63° C. for 30 seconds, and 68° C. for 20 seconds. Afteramplification, 1 μl of the reaction mix was used directly for DNAfragment size analysis using gel electrophoresis (e.g., Agilent 2100Bioanalyzer and High Sensitivity DNA Assay Chip). Data collected fromfragment analysis indicated that the amplification generated a producthaving the expected size of 158 bp.

Next, the amplification product was purified to remove unincorporatedprimers and amplification reaction components using a bead-basedpurification method (Ampure XP, Beckman Coulter) following themanufacturer protocol.

After purification, an adaptor comprising an index sequence (e.g., asdescribed above) was ligated to the amplicon. The amplicon produced bythe amplification reaction above comprised a 5′ phosphate (e.g., fromincorporation of the 5′-phosphate modified primers) and a 3′-dA-overhang(e.g., from of a DNA polymerase that adds a non-templated A at the3′-end of extension products). The ligation reaction was performed usinga commercially available ligation kit (T4 DNA Ligase-Rapid, Enzymatics).In particular, a ligation reaction mixture was produced using the kit“Rapid” ligation buffer, 25 pmol of adaptor, and approximately 0.25 pmolof the amplicon in a 50-μl reaction volume.

After the ligation reaction, the ligation reaction mix was incubated at25° C. for 10 minutes and immediately purified twice using bead-basedpurification (Ampure XP, Beckman Coulter) following the manufacturerprotocol except that the sample input volume to bead solution volume waschanged from 1:1.8 to 1:1.

The purified ligated product was used as a template in a limited-cycle(e.g., 8-cycle) enrichment amplification to amplify the ligated product(FIG. 10). The amplification reaction comprised primers designed toamplify the ligated product comprising the ‘ID-tag’ tag portion (e.g.,10 degenerate bases) and having an expected length of 249 bp:

PCR1 (SEQ ID NO: 15) CCTCTCTATGGGCAGTCGGTGAT ST-PCR1-R-v1(SEQ ID NO: 16) GCACTGGATCACGTCATACCTAC

The amplification was performed using a commercially availablehigh-fidelity polymerase PCR master mix kit (KAPA HiFi HotStart PCR kit,KAPA Biosystems) to produce a reaction mixture comprising 0.25 μM ofeach primer and the purified adaptor-ligated product as template in a25-μ1 reaction volume. The amplification reaction mixtures werethermocycled by incubating at 95° C. for 5 minutes, followed by 8 cyclesof 98° C. for 20 seconds, 60° C. for 30 seconds, and 72° C. for 20seconds. After amplification, 1 μl of the reaction mix was used directlyfor fragment size analysis by gel electrophoresis (Agilent 2100Bioanalyzer and High Sensitivity DNA Assay Chip. Data collected from thefragment analysis indicated that the amplification produced an ampliconof the expected size from the adaptor-ligated product (e.g., a 249-bpamplicon comprising a portion corresponding to the EGFR amplicon of 158bp produced above and a ligated adaptor).

Example 15—Circularization of Target Nucleic Acid

During the development of embodiments of the technology provided herein,experiments were conducted to evaluate a molecular technique based onintramolecular ligation (circularization) of target nucleic acid toorient different regions of the target nucleic acid in a specificarrangement. The method comprises circularizing a target nucleic acid,which places a known sequence (e.g., a universal priming sequence)adjacent to an unknown sequence (e.g., a region of interest to query,e.g., by sequencing) in specific orientation (FIG. 11).

In these experiments, the circularization reactions were performed usinga commercially available ssDNA ligase kit (CircLigase II,Epicentre-Illumina) following the manufacturer protocol. The experimentstested synthetic input templates that were oligonucleotides(“ultramers”) having lengths of 100, 150 and 200 bases:

Ultramer-200bp (SEQ ID NO: 17)pGCAGCATGTGGCACCATCTCACAATTGCCAGTTAACGTCTTCCTTCTCTCTGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCTCTGAACCTCAGGCCCACCTTTTCTCATGTCT G Ultramer-150bp(SEQ ID NO: 18) pGCAGCATGTGGCACCATCTCACAATTGCCAGTTAACGTCTTCCTTCTCTCTATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCTCTGAACCTCAGGCCCACCTTTTCTCATGTCT G Ultramer-100bp(SEQ ID NO: 19) pGCAGCATGTGGCACCATCTCACAATTGCCAGTTAACGTCTTCCTTCTCTCTGATGTGAGTTTCTGCTTTGCTTCCTCAGGCCCACCTTTTCTCATGTCT GIn the sequences of the ultramers (SEQ ID NOs: 17, 18, and 19 above), a“p” indicates a 5′-phosphate modification.

After the circularization reaction, the products were treated withexonuclease I and III (NEB) for 30 minutes at 37° C. to removenon-circularized template. After exonuclease treatment, the exonucleaseswere inactivated by incubating at 80° C. for 10 minutes. To confirmcircularization of the templates, primers were designed to amplifycircle-specific amplification products (FIG. 12):

e19_F_v1 (SEQ ID NO: 20) TGCCAGTTAACGTCTTCCTTCT e19_circ_v1(SEQ ID NO: 21) G*A*TGGTGCCACATGCTGCIn the sequences of the circular template primers (SEQ ID NOs: 20 and 21above), a “*” indicates a phosphorothioate bond.

Amplification reaction mixtures were produced using Taq-Gold (AbbottMolecular), 0.2 μM of each primer, and one of the three differentlysized reaction products as template in a 25-μl reaction volume. Thereaction mixtures were thermocycled by incubating at 95° C. for 5minutes, followed by 38 cycles of 98° C. for 20 seconds, 60° C. for 30seconds, and 68° C. for 30 seconds. After amplification, 10 μl of thereaction mix was used directly for DNA fragment size analysis by gelelectrophoresis using pre-cast 2% agarose gels (E-Gel EX 2% Agarose Gel,Life Technologies). The data collected indicated that the amplificationproduced a product of the expected size from the circular templates,thus confirming the generation of circular nucleic acids from the threetest ultramers. Furthermore, the absence of circle-specific products innegative controls comprising linear templates indicates that the primersproduce circle-specific products.

All publications and patents mentioned in the above specification areherein incorporated by reference in their entirety for all purposes.Various modifications and variations of the described compositions,methods, and uses of the technology will be apparent to those skilled inthe art without departing from the scope and spirit of the technology asdescribed. Although the technology has been described in connection withspecific exemplary embodiments, it should be understood that theinvention as claimed should not be unduly limited to such specificembodiments. Indeed, various modifications of the described modes forcarrying out the invention that are obvious to those skilled in the artare intended to be within the scope of the following claims.

We claim:
 1. A kit for sequencing a target nucleic acid, the kitcomprising: a) a nucleic acid ladder fragment library wherein saidnucleic acid ladder fragment library comprises a plurality of nucleicacid fragments terminated by a 3′-O-propargyl nucleotide; and b) anadaptor oligonucleotide.
 2. The kit of claim 1, wherein said3′-O-propargyl nucleotide comprises a first reactive group and saidadaptor oligonucleotide comprises a second reactive group that iscapable of being linked to said first reactive group by click chemistry.3. The kit of claim 1, wherein said nucleic acid fragment laddercomprises a plurality of nucleic acids having 3′ ends that differ byless than 20 nucleotides.
 4. The kit of claim 1, further comprising acopper-based click chemistry catalyst reagent.
 5. The kit of claim 1,further comprising a computer readable medium comprising instructions toinstruct a computer to assemble short overlapping nucleotide sequencesand to produce a consensus sequence.
 6. The kit of claim 5 wherein: i)each said short overlapping nucleotide sequence comprises less than 100bases; ii) said short overlapping nucleotide sequences are tiled over atarget nucleic acid comprising at least 100 bases; and iii) said shortoverlapping nucleotide sequences are offset from one another by 1-20bases.
 7. The kit of claim 1, further comprising a polymerase.
 8. Thekit of claim 1, further comprising a second adaptor oligonucleotide. 9.The kit of claim 1, further comprising one or more compositionscomprising a nucleotide or a mixture of nucleotides.
 10. The kit ofclaim 1, further comprising a ligase.
 11. The kit of claim 1, furthercomprising a sequencing primer.